SwarmDataManager
SwarmDataManager
Manages Swarm satellite data download, storage, and retrieval operations.
This class provides a high-level interface for working with Swarm data independently from the main MagGeo pipeline.
Functions
__init__(data_dir='swarm_data', file_format='csv', chunk_size=10, token=None)
Initialize SwarmDataManager.
Parameters
data_dir : str, default "swarm_data" Directory to store downloaded Swarm data file_format : str, default "parquet" File format for saving data. Options: "csv", "parquet" chunk_size : int, default 10 Number of dates to process in each batch token : str, optional VirES token for authentication
download_for_trajectory(gps_df, save_individual_files=True, save_concatenated=True, resume=True)
Download Swarm data for an entire GPS trajectory.
Parameters
gps_df : pd.DataFrame GPS trajectory data with datetime information save_individual_files : bool, default True Whether to save individual daily files save_concatenated : bool, default True Whether to save concatenated files for each satellite resume : bool, default True Whether to skip already downloaded files
Returns
tuple Tuple of concatenated DataFrames for satellites A, B, C
download_for_dates(dates, save_individual_files=True, save_concatenated=True, resume=True)
Download Swarm data for specific dates.
Parameters
dates : List[dt.date] List of dates to download data for save_individual_files : bool, default True Whether to save individual daily files save_concatenated : bool, default True Whether to save concatenated files for each satellite resume : bool, default True Whether to skip already downloaded files
Returns
tuple Tuple of concatenated DataFrames for satellites A, B, C
load_data_for_dates(dates, satellites=['A', 'B', 'C'])
Load previously downloaded Swarm data for specific dates.
Parameters
dates : List[dt.date] List of dates to load data for satellites : List[str], default ['A', 'B', 'C'] Which satellites to load data for
Returns
dict Dictionary with satellite names as keys and concatenated DataFrames as values
load_concatenated_data(satellites=['A', 'B', 'C'])
Load previously saved concatenated Swarm data.
Parameters
satellites : List[str], default ['A', 'B', 'C'] Which satellites to load data for
Returns
dict Dictionary with satellite names as keys and DataFrames as values
get_data_summary()
Get summary of available downloaded data.
Returns
pd.DataFrame Summary of available data files with metadata
cleanup_data(older_than_days=None, quality_threshold='poor')
Clean up downloaded data files.
Parameters
older_than_days : int, optional Remove files older than this many days quality_threshold : str, default 'poor' Remove files with data quality below this threshold
Returns
int Number of files removed
Overview
The SwarmDataManager
is a class for efficient Swarm satellite data management in MagGeo. It provides persistent storage, resume capabilities, and intelligent data organization for research workflows.
Key Features
- Persistent Storage: Download once, use many times
- Resume Capability: Continue interrupted downloads
- Multiple Formats: Parquet, CSV, and Pickle support
- Automatic Organization: Structured directory layout
- Data Quality: Built-in quality assessment and filtering
- Memory Efficient: Lazy loading and chunked processing
Quick Start
from maggeo import SwarmDataManager
import pandas as pd
# Create manager
manager = SwarmDataManager(
data_dir="my_swarm_data",
file_format="parquet"
)
# Load GPS trajectory
gps_df = pd.read_csv("trajectory.csv")
gps_df['timestamp'] = pd.to_datetime(gps_df['timestamp'])
# Download Swarm data
swarm_a, swarm_b, swarm_c = manager.download_for_trajectory(
gps_df,
token="your_vires_token"
)
Directory Structure
The manager creates an organized directory structure:
my_swarm_data/
├── swarm_A/
│ ├── 2020/
│ │ ├── 01/
│ │ │ ├── swarm_A_2020-01-01.csv
│ │ │ └── swarm_A_2020-01-02.csv
│ │ └── 02/
│ └── concatenated/
│ └── swarm_A_2020-01-01_to_2020-01-31.csv
├── swarm_B/
├── swarm_C/
└── metadata/
├── download_log.json
└── quality_reports.json
Advanced Usage
Custom Configuration
manager = SwarmDataManager(
data_dir="swarm_data",
file_format="parquet",
chunk_size=10000,
parallel_download=True,
quality_filter=True,
compression="snappy"
)
Batch Operations
# Download for multiple trajectories
trajectories = ["traj1.csv", "traj2.csv", "traj3.csv"]
for traj_file in trajectories:
gps_df = pd.read_csv(traj_file)
manager.download_for_trajectory(gps_df, token=token)
# Load all data at once
all_data = manager.load_concatenated_data(
satellites=['A', 'B', 'C'],
start_date='2020-01-01',
end_date='2020-12-31'
)
Quality Control
# Get quality report
quality_report = manager.get_quality_report('A')
print(f"Data coverage: {quality_report['coverage']:.2%}")
print(f"Missing points: {quality_report['missing_count']}")
# Filter by quality
high_quality_data = manager.load_concatenated_data(
satellites=['A'],
quality_threshold=0.9
)
Performance Tips
Optimization Strategies
- Use Parquet format for best performance
- Enable parallel download for large date ranges
- Set appropriate chunk_size based on available memory
- Use concatenated files for repeated analysis
- Filter by quality to reduce processing time
Error Handling
The manager implements robust error handling:
try:
data = manager.download_for_trajectory(gps_df, token=token)
except SwarmDataError as e:
print(f"Swarm data error: {e}")
except NetworkError as e:
print(f"Network error: {e}")
except StorageError as e:
print(f"Storage error: {e}")
Integration with Core Functions
The manager seamlessly integrates with MagGeo's core functions:
# Use with main annotation function
params = {
'data_dir': 'gps_data',
'gpsfilename': 'trajectory.csv',
'use_swarm_manager': True,
'swarm_data_dir': 'swarm_data',
'swarm_manager_format': 'parquet',
# ... other params
}
result = maggeo.annotate_gps_with_geomag(params)