SwarmDataManager

`SwarmDataManager`

Manages Swarm satellite data download, storage, and retrieval operations.

This class provides a high-level interface for working with Swarm data independently from the main MagGeo pipeline.

Functions

`init(data_dir='swarm_data', file_format='csv', chunk_size=10, token=None)`

Initialize SwarmDataManager.

Parameters

data_dir : str, default "swarm_data" Directory to store downloaded Swarm data file_format : str, default "parquet" File format for saving data. Options: "csv", "parquet" chunk_size : int, default 10 Number of dates to process in each batch token : str, optional VirES token for authentication

`download_for_trajectory(gps_df, save_individual_files=True, save_concatenated=True, resume=True)`

Download Swarm data for an entire GPS trajectory.

Parameters

gps_df : pd.DataFrame GPS trajectory data with datetime information save_individual_files : bool, default True Whether to save individual daily files save_concatenated : bool, default True Whether to save concatenated files for each satellite resume : bool, default True Whether to skip already downloaded files

Returns

tuple Tuple of concatenated DataFrames for satellites A, B, C

`download_for_dates(dates, save_individual_files=True, save_concatenated=True, resume=True)`

Download Swarm data for specific dates.

Parameters

dates : List[dt.date] List of dates to download data for save_individual_files : bool, default True Whether to save individual daily files save_concatenated : bool, default True Whether to save concatenated files for each satellite resume : bool, default True Whether to skip already downloaded files

Returns

tuple Tuple of concatenated DataFrames for satellites A, B, C

`load_data_for_dates(dates, satellites=['A', 'B', 'C'])`

Load previously downloaded Swarm data for specific dates.

Parameters

dates : List[dt.date] List of dates to load data for satellites : List[str], default ['A', 'B', 'C'] Which satellites to load data for

Returns

dict Dictionary with satellite names as keys and concatenated DataFrames as values

`load_concatenated_data(satellites=['A', 'B', 'C'])`

Load previously saved concatenated Swarm data.

Parameters

satellites : List[str], default ['A', 'B', 'C'] Which satellites to load data for

Returns

dict Dictionary with satellite names as keys and DataFrames as values

`get_data_summary()`

Get summary of available downloaded data.

Returns

pd.DataFrame Summary of available data files with metadata

`cleanup_data(older_than_days=None, quality_threshold='poor')`

Clean up downloaded data files.

Parameters

older_than_days : int, optional Remove files older than this many days quality_threshold : str, default 'poor' Remove files with data quality below this threshold

Returns

int Number of files removed

Overview

The SwarmDataManager is a class for efficient Swarm satellite data management in MagGeo. It provides persistent storage, resume capabilities, and intelligent data organization for research workflows.

Key Features

Persistent Storage: Download once, use many times
Resume Capability: Continue interrupted downloads
Multiple Formats: Parquet, CSV, and Pickle support
Automatic Organization: Structured directory layout
Data Quality: Built-in quality assessment and filtering
Memory Efficient: Lazy loading and chunked processing

Quick Start

from maggeo import SwarmDataManager
import pandas as pd

# Create manager
manager = SwarmDataManager(
    data_dir="my_swarm_data",
    file_format="parquet"
)

# Load GPS trajectory
gps_df = pd.read_csv("trajectory.csv")
gps_df['timestamp'] = pd.to_datetime(gps_df['timestamp'])

# Download Swarm data
swarm_a, swarm_b, swarm_c = manager.download_for_trajectory(
    gps_df,
    token="your_vires_token"
)

Directory Structure

The manager creates an organized directory structure:

my_swarm_data/
├── swarm_A/
│   ├── 2020/
│   │   ├── 01/
│   │   │   ├── swarm_A_2020-01-01.csv
│   │   │   └── swarm_A_2020-01-02.csv
│   │   └── 02/
│   └── concatenated/
│       └── swarm_A_2020-01-01_to_2020-01-31.csv
├── swarm_B/
├── swarm_C/
└── metadata/
    ├── download_log.json
    └── quality_reports.json

Advanced Usage

Custom Configuration

manager = SwarmDataManager(
    data_dir="swarm_data",
    file_format="parquet",
    chunk_size=10000,
    parallel_download=True,
    quality_filter=True,
    compression="snappy"
)

Batch Operations

# Download for multiple trajectories
trajectories = ["traj1.csv", "traj2.csv", "traj3.csv"]

for traj_file in trajectories:
    gps_df = pd.read_csv(traj_file)
    manager.download_for_trajectory(gps_df, token=token)

# Load all data at once
all_data = manager.load_concatenated_data(
    satellites=['A', 'B', 'C'],
    start_date='2020-01-01',
    end_date='2020-12-31'
)

Quality Control

# Get quality report
quality_report = manager.get_quality_report('A')
print(f"Data coverage: {quality_report['coverage']:.2%}")
print(f"Missing points: {quality_report['missing_count']}")

# Filter by quality
high_quality_data = manager.load_concatenated_data(
    satellites=['A'],
    quality_threshold=0.9
)

Performance Tips

Optimization Strategies

Use Parquet format for best performance
Enable parallel download for large date ranges
Set appropriate chunk_size based on available memory
Use concatenated files for repeated analysis
Filter by quality to reduce processing time

Error Handling

The manager implements robust error handling:

try:
    data = manager.download_for_trajectory(gps_df, token=token)
except SwarmDataError as e:
    print(f"Swarm data error: {e}")
except NetworkError as e:
    print(f"Network error: {e}")
except StorageError as e:
    print(f"Storage error: {e}")

Integration with Core Functions

The manager seamlessly integrates with MagGeo's core functions:

# Use with main annotation function
params = {
    'data_dir': 'gps_data',
    'gpsfilename': 'trajectory.csv',
    'use_swarm_manager': True,
    'swarm_data_dir': 'swarm_data',
    'swarm_manager_format': 'parquet',
    # ... other params
}

result = maggeo.annotate_gps_with_geomag(params)

SwarmDataManager