This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Blog

NWB Compression Recommendations for DANDI Archive
Unraveling Anesthesia's Effect on Brain Activity
Human ex-vivo microscopy workshop/hackathon
External Links in NWB and DANDI
DANDI User Workshop
DANDI: A data archive and collaboration space for neurophysiology

This is the blog section.

Files in these directories will be listed in reverse chronological order.

NWB Compression Recommendations for DANDI Archive

An analysis of HDF5 and Zarr compression options for NWB files submitted to the DANDI Archive, providing recommendations based on file size, read speed, and write speed.

By Ben Dichter | Monday, November 03, 2025

When submitting large datasets to the DANDI Archive, it’s crucial to consider data compression options that can substantially reduce file sizes. Smaller files reduce the storage burden on the DANDI Archive and make datasets more convenient to download for users. Neurodata Without Borders (NWB) now supports two file format backends: HDF5 and Zarr. Both formats have built-in capabilities for chunking and compression that can break large datasets into smaller pieces and apply lossless compression to each chunk. This approach reduces file size without altering the dataset values.

DANDI recommends compressing large datasets, but the variety of algorithms and settings available can be overwhelming for NWB users unfamiliar with data engineering. This post aims to answer several common questions:

By how much can the volume of my data be reduced?
What compression options are available and how can I access them?
Which compression algorithms and settings are optimal for my use case?

The answers depend on specific use cases, but our analysis provides general guidelines to help you make informed decisions based on three key metrics: file size, read speed, and write speed. You may also need to consider accessibility - whether the HDF5 or Zarr library can read your file out of the box or requires installation of a dynamic filter.

When considering a compression algorithm, you should consider the following trade-offs:

File Size: Higher compression levels typically yield smaller files but may require more time to compress and decompress.
Read Speed: Some algorithms prioritize fast decompression, which is beneficial for users frequently accessing the data.
Write Speed: If you frequently write or update data, faster compression algorithms may be preferable, even if they result in larger file sizes.
Accessibility: Some compression algorithms are built into HDF5 and Zarr libraries, while others require additional installations. Choosing widely supported algorithms can enhance data accessibility.

Our Testing Approach

We conducted an evaluation of several popular compression algorithms in HDF5 using the h5plugin library, which simplifies the installation process for various compression algorithms and makes them available to h5py. For Python users, we highly recommend this package.

For our test data, we used action potential recordings from a Neuropixel probe (SpikeGLX acquisition system) from DANDI:000053, consisting of high-pass voltage data prior to spike-sorting. This represents a common use case for NWB files. While this analysis focuses on HDF5, similar compression principles apply to Zarr backends as well. We relied on h5py to automatically determine chunk shapes, with shuffle turned off.

Understanding Compression Algorithms

Before diving into the results, it’s helpful to understand the different compression algorithms tested and how they relate to each other:

Core Algorithms

gzip: A widely-used compression algorithm based on the DEFLATE method. It’s built directly into HDF5, making it the most accessible option (no additional libraries needed). Gzip prioritizes compression ratio over speed and has been a standard for decades.
zstd (Zstandard): A modern compression algorithm developed by Facebook that offers an excellent balance between compression ratio and speed. It provides adjustable compression levels (higher levels = better compression but slower) and has become increasingly popular for scientific data.
lz4: Designed primarily for speed, LZ4 prioritizes extremely fast compression and decompression at the cost of somewhat lower compression ratios. It’s ideal when read/write performance is the primary concern.

Blosc Meta-Compressor

Blosc is not a compression algorithm itself, but rather a meta-compressor that adds additional optimizations on top of existing compression algorithms. Blosc splits data into smaller blocks and applies multi-threading, which can significantly improve performance for large arrays. When you see “blosc” in our results:

blosc zstd: Blosc wrapper around the Zstandard algorithm
blosc lz4: Blosc wrapper around the LZ4 algorithm
blosc zlib: Blosc wrapper around the zlib algorithm (similar to gzip)
blosc blosclz: Blosc’s own compression codec
blosc lz4hc: Blosc wrapper around LZ4 in “high compression” mode

The blosc-wrapped versions often show different performance characteristics than their standalone counterparts due to Blosc’s blocking and multi-threading optimizations, which can be particularly effective for numerical arrays like neurophysiology data.

Additional Resources

For those interested in learning more about compression principles and their application to scientific data:

Data Compression Explained - Comprehensive introduction to data compression fundamentals
Blosc Documentation - In-depth explanation of the Blosc meta-compressor
HDF5 Filters - Technical details on how HDF5 implements compression filters

Results and Recommendations

Compression algorithm comparison Figure 1: Comparison of compression algorithms. The black circle shows data that is chunked but not compressed. Each color represents a different algorithm, with lightness/darkness indicating compression level.

Key Findings:

Best Overall Performance: zstd level 3
- Provides excellent compression ratio (about 45% reduction)
- Maintains good read and write speeds at moderate compression levels
- At higher levels, write time increases dramatically with minimal additional space savings
Fastest Read Performance: blosc lz4
- If optimizing primarily for read speed, blosc lz4 is about twice as fast as zstd
- For blosc lz4, level 3 or 5 offers a good balance between size and speed
Alternative Option: gzip
- Most accessible because it is built into HDF5 (no additional configuration needed)
- Read time is about 1.8x slower than zstd
- At moderate levels (e.g., the default level 4), offers similar file size and write time to zstd
- Good option for users unable to configure custom dynamic filters
Best Compression Ratio: blosc zstd
- Achieves slightly better compression ratio and slightly worse read time than standard zstd
- Significantly longer write times at higher compression levels

Detailed Results

For those interested in the exact benchmark numbers, the table below shows all tested compression configurations with their corresponding write times, file sizes, and read times:

name	write time (s)	file size (MB)	read time (s)
no_filter None	0.170	768.009	0.098
zstd (0,)	1.857	387.450	1.005
zstd (3,)	3.222	354.957	1.289
zstd (6,)	8.464	347.277	1.339
zstd (9,)	14.810	345.556	1.278
lz4 (0,)	1.605	572.530	1.021
lz4 (1000.0,)	1.425	696.795	0.916
lz4 (1000000.0,)	1.589	572.530	1.107
gzip 1	4.688	359.142	2.072
gzip 4	8.018	346.226	2.099
gzip 9	60.077	352.872	2.184
lzf None	1.607	527.347	0.982
blosc blosclz 1	0.650	768.438	0.341
blosc blosclz 3	1.183	684.614	0.653
blosc blosclz 5	1.175	684.614	0.604
blosc blosclz 7	1.846	617.389	0.819
blosc blosclz 9	2.432	574.626	1.004
blosc lz4 1	1.442	650.519	0.553
blosc lz4 3	1.635	614.405	0.591
blosc lz4 5	1.721	598.747	0.589
blosc lz4 7	1.839	581.546	0.619
blosc lz4 9	1.959	567.020	0.621
blosc lz4hc 1	3.612	536.971	0.576
blosc lz4hc 3	7.567	487.303	0.565
blosc lz4hc 5	11.790	470.780	0.533
blosc lz4hc 7	13.138	469.490	0.560
blosc lz4hc 9	13.091	469.482	0.552
blosc zlib 1	5.289	371.400	2.118
blosc zlib 3	7.699	356.136	2.012
blosc zlib 5	21.409	364.851	2.321
blosc zlib 7	45.256	361.311	2.158
blosc zlib 9	50.259	360.995	2.147
blosc zstd 1	1.839	394.557	0.960
blosc zstd 3	5.508	344.206	1.355
blosc zstd 5	14.713	345.752	1.215
blosc zstd 7	45.198	333.719	1.328
blosc zstd 9	108.811	324.897	1.509

Conclusion

By applying appropriate compression settings to your NWB files before uploading to DANDI, you can significantly reduce storage requirements and improve download experiences for users of your datasets. For most neurophysiology datasets using the HDF5 backend, we recommend zstd level 3 or 4 as a good default that balances compression ratio with performance. However, if you have specific requirements for read speed or file size, you might consider blosc lz4 or blosc zstd respectively.

While this analysis focused on HDF5, Zarr offers similar compression options and benefits. Zarr natively supports blosc, zstd, and gzip compression algorithms with comparable performance characteristics. This is an active area of research, and there are newer algorithms that are designed specifically for electrophysiology and available in Zarr. See Compression strategies for large-scale electrophysiology data for a detailed discussion of these algorithms.

Remember that optimal compression settings may vary depending on your specific data characteristics and usage patterns. We encourage you to experiment with different options using the provided code example to find the best configuration for your datasets.

Code

import os
import hdf5plugin
import h5py
import time
from tqdm import tqdm
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import json
import pandas as pd
import seaborn as sns


# set up parameters for tests
filters = {
    'no_filter': dict(
        filter_class={'chunks':(10000, 384)},
        compression_opts_list=[None],
    ),
    'zstd': dict(
        filter_class=hdf5plugin.Zstd(),
        compression_opts_list=[
            (0,),
            (3,),
            (6,),
            (9,),

        ],
    ),
    'lz4': dict(
        filter_class=hdf5plugin.LZ4(),
        compression_opts_list=[
            (0,),
            (1e3,),
            (1e6,),
        ],
    ),
    'gzip': dict(
        filter_class=dict(compression='gzip'),
        compression_opts_list=[
            1, 4, 9
        ]
    ),
    'lzf': dict(
        filter_class=dict(compression='lzf'),
        compression_opts_list=[None,],
    )
}


blosc_filters = ["blosclz", "lz4", "lz4hc", "zlib", "zstd"]

results = []


def test_filter(run_name, **kwargs):
    """Get write time, read time, and file size for a run."""

    fname = f"{run_name}.h5"


    with h5py.File(fname, mode="w") as file:
        start = time.time()
        file.create_dataset(**kwargs)
        write_time = time.time() - start

    file_size = os.stat(fname).st_size / 1e6

    with h5py.File(fname, mode="r") as file:
        start = time.time()
        data = file['data'][:]
        read_time = time.time() - start

    return {
        'name': run_name,
        'write time (s)': write_time,
        'file size (MB)': file_size,
        'read time (s)': read_time,
    }

def create_mini_palette(color, filter_, compression_opts_list):
    """Create palette of different shades of the same color for a given filter."""

    mini_pallete = sns.light_palette(
    color=color,
    n_colors=len(compression_opts_list)+2)[2:]
    for comp_opts, this_color in zip(compression_opts_list, mini_pallete):
        palette.update({f"{filter_} {comp_opts}": this_color})
    return palette


import lindi

# Load https://api.dandiarchive.org/api/assets/ac23d031-019a-4c07-854d-20a3647f3097/download/
local_cache = lindi.LocalCache()

file = lindi.LindiH5pyFile.from_lindi_file("https://lindi.neurosift.org/dandi/dandisets/000053/assets/ac23d031-019a-4c07-854d-20a3647f3097/nwb.lindi.json", local_cache=local_cache)

data = file['acquisition/ElectricalSeries/data'][:1000000]
print(f"{data.nbytes / 1e9} GB", flush=True)


## run filters

#test non-blosc filters
for filter_name, filter_dict in tqdm(list(filters.items()), desc='non-blosc'):
    args = dict(
        name="data",
        data=data,
        **filter_dict["filter_class"]
    )
    for compression_opts in filter_dict["compression_opts_list"]:
        args.update(compression_opts=compression_opts)
        run_name = f"{filter_name} {compression_opts}"
        results.append(test_filter(run_name, **args))

# test blosc filters
for filter_name in tqdm(blosc_filters, desc='blosc'):
    for level in range(1, 10, 2):
        args=dict(
            name="data",
            data=data,
            **hdf5plugin.Blosc(cname=filter_name, clevel=level, shuffle=0)
        )
        run_name = f"blosc {filter_name} {level}"
        try:
            results.append(test_filter(run_name, **args))
        except Exception as e:
            print(f"Error occurred for {run_name}: {e}")


df = pd.DataFrame(results)

# Calculate compression ratio (original size / compressed size)
original_size_mb = data.nbytes / 1e6
df['compression ratio'] = original_size_mb / df['file size (MB)']

df.to_parquet("results.parquet")

# uncomment for previous results
# df = pd.read_parquet("results.parquet")

# Save results as HTML table
df.to_html("results.html", index=False, float_format=lambda x: f'{x:.3f}')

# visualization - construct palette

# for non-blosc
counter = 0
base_palette = sns.color_palette(n_colors=11)
palette = {'no_filter None': 'k'}
for filter_, filter_dict in list(filters.items())[1:]:
    color = base_palette[counter]
    if filter_ not in ('no_filter None',):
        palette.update(create_mini_palette(color, filter_, filter_dict["compression_opts_list"]))
    counter += 1


# for blosc
for filter_name in blosc_filters:
    counter += 1
    color = base_palette[counter]
    palette.update(create_mini_palette(color, "blosc " + filter_name, range(1, 10, 2)))

n_filters = sum(len(x["compression_opts_list"]) for x in filters.values())
n_blosc_filters = 7*len(blosc_filters)
markers = ['o'] * n_filters + ['x'] * n_blosc_filters

# Create PairGrid without diagonals
g = sns.PairGrid(data=df, hue="name", palette=palette, corner=True, height=4, diag_sharey=False)

# Map scatter plots to lower triangle with custom markers
def scatter_with_markers(x, y, **kwargs):
    """Custom scatter function that uses different markers for blosc vs non-blosc."""
    data = kwargs.pop('data')
    hue = kwargs.pop('hue', None)
    
    for name in data[hue].unique():
        subset = data[data[hue] == name]
        marker = 'x' if name.startswith('blosc') else 'o'
        color = palette.get(name, 'gray')
        plt.scatter(subset[x.name], subset[y.name], marker=marker, c=[color], 
                   s=30, alpha=0.6, label=name)

# Use custom function for lower triangle
for i in range(len(g.axes)):
    for j in range(len(g.axes[i])):
        if i > j:  # Lower triangle only - plot data
            ax = g.axes[i, j]
            x_var = g.x_vars[j]
            y_var = g.y_vars[i]
            
            for name in df['name'].unique():
                subset = df[df['name'] == name]
                marker = 'X' if name.startswith('blosc') else 'o'
                color = palette.get(name, 'gray')
                ax.scatter(subset[x_var], subset[y_var], marker=marker, c=[color],
                          s=30, alpha=0.6, label=name)
            
            ax.set_xlabel(x_var)
            ax.set_ylabel(y_var)
        elif i == j:  # Diagonal - remove these axes
            g.axes[i, j].remove()

# Add legend with custom markers
handles = []
labels = []
for name in df['name'].unique():
    marker = 'X' if name.startswith('blosc') else 'o'
    color = palette.get(name, 'gray')
    handles.append(Line2D([0], [0], marker=marker, color='w', 
                         markerfacecolor=color, markersize=8, alpha=0.6))
    labels.append(name)

g.fig.legend(handles, labels, loc='center right', bbox_to_anchor=(0, 0.5), ncol=2)

plt.savefig("eval_compressions.pdf", bbox_inches="tight")
plt.savefig("eval_compressions.png", bbox_inches="tight", dpi=300)

Unraveling Anesthesia's Effect on Brain Activity

Have you ever wondered what happens to your brain when you’re under anesthesia? What neural changes coincide with that dramatic loss of consciousness? In this post, we’ll explore how two different research teams investigated this fascinating question using the same data from DANDI Archive’s Dandiset #000458.

By Paul Adkisson (@pauladkisson) | Sunday, November 10, 2024

Data Reuse Highlight:

This blog entry is part of an ongoing series that showcases examples of data reuse and reanalysis using publicly available data on DANDI.

The Original Study: Peering into the Anesthetized Brain

In June 2023, Dr. Leslie Claar and colleagues from the Allen Institute published their groundbreaking findings in the journal eLife in the article, “Cortico-thalamo-cortical interactions modulate electrically evoked EEG responses in mice”. Their approach was comprehensive: they stimulated mouse cortex while simultaneously recording with electroencephelography (EEG) and neuropixels probes, comparing brain activity during wakefulness and under isoflurane anesthesia.

In awake mice, Claar’s team observed a robust event-related potential (ERP) when they delivered pulsatile electrical stimulation to deep layers of the motor cortex (Figure 1A left). This ERP coincided with a distinct pattern of neural spiking, both locally in the motor cortex and in distant sensorimotor-related thalamic nuclei (Figure 1B left). The response followed a three-act play: initial excitation within 25 milliseconds, a period of quiescence for about 125 milliseconds, and finally, a strong rebound excitation before returning to baseline. When the mice were under anesthesia, the story changed dramatically. The ERP’s amplitude and complexity plummeted in response to the same stimulation (Figure 1A right). While the initial excitation persisted in the motor cortex, it vanished in the thalamic nuclei (Figure 1B right). The rebound excitation, along with its corresponding ERP component at around 180 milliseconds, was significantly weakened.

Based on these observations, Claar and colleagues proposed that interactions between the cortex and thalamus are responsible for the rebound excitation phase of the stimulus response. This rebound excitation, in turn, drives the second component of the ERP – a crucial signal of wakefulness.

After their experiments, Claar and her team published their data on DANDI, ensuring that it would remain publicly accessible for any future researchers who wanted to use it. The dataset and manuscript were prepared and published concurrently, which ensured that the dataset included all the paper-relevant data and metadata and curious reviewers could inspect the data directly if they so wished.

Figure 1: Brain state modulates the ERP via cortico-thalamo-cortical interactions. (A) Butterfly plot of ERPs during non-running (quiet wakefulness), running (active wakefulness), and isoflurane-anesthetized states. (B) Normalized firing rate, reported as a z-score of the average, pre-stimulus firing rate, of all RS neurons recorded by the Neuropixels probes targeting the stimulated cortex (MO) and SM-TH. Reproduced from “Cortico-thalamo-cortical interactions modulate electrically evoked EEG responses in mice”.

A New Perspective: Zooming in on Synaptic Potentials

Just a few months later, in November 2023, Drs. Richard Burman, Paul Brodersen and colleagues at the University of Oxford wrote a paper that included reanalysis of this data. Their findings were published in the journal Neuron with the article, “Active cortical networks promote shunting fast synaptic inhibition in vivo”. They also investigated the effects of anesthesia on brain activity, but narrowed their focus to the differential function of a specific receptor subtype: GABA_AR.

Burman and Brodersen’s team used a technique called in vivo gramicidin perforated patch-clamp recording, combined with optogenetic activation of GABAergic synaptic inputs. They compared the GABA_A receptor equilibrium potential (E_{GABA_AR}) in mice during wakefulness and under urethane anesthesia.

Their findings were striking. In anesthetized mice, GABA_A receptor responses were significantly hyperpolarizing, with an average equilibrium potential of -80.8mV. In awake mice, however, these responses could be either hyperpolarizing or depolarizing, averaging at -63.9mV – a state known as the “shunting” regime.

Burman et al. hypothesized that the higher network activity in the awake state leads to a higher concentration of intracellular chloride, which in turn depolarizes the GABA_A receptor equilibrium potential. Anesthesia, by broadly suppressing neural activity, indirectly hyperpolarizes E_{GABA_AR}. They confirmed this hypothesis by locally injecting an AMPA receptor antagonist (NBQX) which suppressed network activity, and hyperpolarized E_{GABA_AR} to -78.0mV.

The Oxford researchers also hypothesized that changes in E_{GABA_AR} would lead to changes in network activity, resulting in a bi-directional relationship between the two quantities.

To test this hypothesis, Burman and Brodersen’s team turned to the data from Dandiset #000458 – the very same dataset used by Claar et al. By reanalyzing this data and manipulating a computational network model, they found that setting the E_{GABA_AR} parameter to the shunting regime (-60mV) reproduced spiking responses similar to those seen in awake mice (Figure 2 B&F left). Conversely, setting it to the hyperpolarizing regime (-80mV) mimicked the responses seen in anesthetized mice (Figure 2 B&F right). This data provides evidence for the hypothesized bidirectional relationship.

Figure 2: Shunting inhibition promotes local network desynchronization and response flexibility. (A) High-density Neuropixels (NPXs) recordings were used to compare spiking activity in the same cortical neurons under awake and anesthetized conditions. (B) Example raster plots show the spiking activity of a population of neurons in somatosensory cortex (SS). Spiking activity is shown across three stimulation trials for each condition. (C) Neuronal synchrony on a trial-to-trial basis (top) and the entropy of the peri-stimulus histogram (bottom) were calculated for each of the 662 recorded neurons under awake and anesthetized conditions (17 probe recordings in 16 mice). Each dot corresponds to a single neuron and the dashed line indicates the line of equality. (D) Neurons in the awake condition exhibited lower synchrony (top; Aw.: 0.325 ± 0.005 vs. An.: 0.393 ± 0.007, p < 0.001, paired t test) and higher entropy (bottom; Aw.: 1.347 ± 0.018 nats vs. An.: 1.111 ± 0.020 nats, p < 0.001, paired t test). (E) Schematic of network model consisting of interconnected excitatory pyramidal neurons (Pyr.) and inhibitory interneurons (IN). EGABAAR in the pyramidal neurons was adjusted relative to the RMP to create two different conditions: a shunting (Shunt.) and a hyperpolarizing (Hyperpol.) EGABAAR condition. Spiking activity was evoked by delivering brief depolarizing currents (input) of varying amplitudes to each neuron in the network. (F) Raster plots for the same population of pyramidal neurons (n = 50) in the shunting (left) and hyperpolarizing (right) EGABAAR conditions. (G) Synchrony (top) and entropy (bottom) for pyramidal neurons in the shunting and hyperpolarizing EGABAAR conditions (n = 1,000 randomly selected). Dashed line indicates the line of equality. (H) Neurons in the shunting EGABAAR condition exhibited lower synchrony (Shunt: 0.784 ± 0.001 vs. Hyperpol: 0.881 ± 0.001, p < 0.001, paired t test) and higher entropy (Shunt: 0.451 ± 0.002 nats vs. Hyperpol: 0.197 ± 0.002 nats, p < 0.001, paired t test) (n = 16,000 pyramidal neurons from the model). ∗∗∗p < 0.001. MO, motor cortex; Stim., electrical stimulation; TH, thalamus. Reproduced from “Active cortical networks promote shunting fast synaptic inhibition in vivo”.

Bridging Scales: From Synapses to Networks

Together, these two studies provide a rich, multi-level account of what happens to brain activity under anesthesia. They span from macroscale brain signals recorded by EEG all the way down to microscale synaptic responses from a single receptor type. The population spiking data from Dandiset #000458 serves as a crucial bridge between these levels, allowing Burman et al. to explicitly connect their work to the broader field of anesthesia research.

However, it’s important to note some key differences between the two studies. Claar et al. used isoflurane anesthesia and stimulated layers 5/6 of the cortex, while Burman et al. used urethane anesthesia and recorded from layer 2/3. Additionally, while Claar et al. found the most dramatic differences between waking and anesthetized states around 180 ms after stimulation, Burman et al. focused their reanalysis on the initial spiking response within 15 ms of stimulation.

These differences highlight the potential for further reanalysis of Dandiset #000458. Future studies could explore more similar experimental conditions, conduct a more thorough treatment of relevant response time periods, or investigate different aspects of anesthesia’s effects altogether.

Engaging with the Open Science Community

For those inspired by the work of Claar, Burman and Brodersen, and their teams, the journey doesn’t end here. We encourage readers from all backgrounds to take advantage of this open science platform by publishing and reanalyzing data on DANDI. We are confident that it will be a smooth and rewarding experience, but don’t just take our word for it – listen to what Dr. Leslie Claar and Dr. Paul Brodersen have to say,

“After organizing our dataset into NWB format, validating and uploading our files was straightforward using a few simple commands from the DANDI Python Client. The tools are well-documented, which made the process of uploading and publishing quite smooth. Within a few months of publishing on DANDI, others were already interested in using our dataset.” – Dr. Leslie Claar

“The DANDI archive is a fantastic resource with great tooling and exhaustive documentation. Its command line interface makes data retrieval and synchronization as simple as possible (but not simpler!). The standardized and well-documented formats for electrophysiological and neuroimaging data ensure that anyone can quickly find their way around a new archive without requiring extensive feedback from the authors that originally collected the data. These features make the DANDI archive a treasure trove of useful and reusable data, unlike the unfortunately still commonplace unstructured data dumps on many other platforms or – worst of all – university servers.” – Dr. Paul Brodersen and Professor Colin Akerman

To get you started on your own reanalysis project, we’ve prepared a Jupyter notebook demonstrating some basic reanalysis of Dandiset #000458. This tool will help you dive right into the data, guiding you through the initial steps of exploration and analysis. We have also prepared a Neurosift walkthrough which demonstrates how to use Neurosift to quickly and easily explore the NWB files in this dataset without even leaving the browser.

By engaging with this open science platform, you become part of a larger community of researchers who are building on each other’s work and driving the field forward. In the world of open science, collaboration and curiosity drive discovery—so why not see where the data takes you?

Human ex-vivo microscopy workshop/hackathon

The goal of this set of events is to introduce the BICCN human ex vivo MRI and microscopy datasets, involve the relevant communities to develop and explore scientific ideas, create tooling around these and future datasets to make it easy, and plan towards development of an online, digital, cellular atlas and knowledge base of the human brain.

By DANDI team | Tuesday, March 08, 2022

Date/Time: April 21 - 22, 2022. 9am - 6pm ET

Online: Link will be sent to registered participants.

If you are interested in participating please fill in this form to be sent announcements: Register here

Pre-meeting presentations

YouTube Playlist

The Dandisets
- Dandiset 000026 video: Link to dataset
- Dandiset 000108 video: Link to dataset
The Standards
- BIDS-Microscopy video
  - BIDS Microscopy Specification
  - Slides
- OME-NGFF video
  - NGFF Specification
  - Slides
The Tools
- Neuroglancer video: Link to Github repository
- NeuroTrAle video

Agenda for the hackathon

The events will run 9 - 6 ET to accommodate multiple time zones. Participants are welcome to join at any time.

Day 1: April 21, 2022 (9 - 6 ET)

DANDI+Microscopy Starter Kit: We will generate a set Starter Kit that makes it easy for users to navigate these and other Dandisets, including easier visualization of individual files.
Dandiset Abstraction Layer:
- Create a set of scripts to make it easy to programmatically access these Dandisets
- Be able to view and compute on them.

Unconference session (11.30 - 12.30 ET)

Day 2: April 22, 2022 (9 - 6 ET)

Visualization/Annotation:
- Develop a visualization of a single slab or a set of related chunks
- Develop the steps to annotate using NeuroTrAle
Registration:
- Develop scripts to register lightsheet data to MRI at different scales

Unconference session (11.30 - 12.30 ET)

Please feel free to comment on this community doc.

External Links in NWB and DANDI

External Files: video/audio files that are part of the experiment but are not stored in NWB as HDF5 format.

By Saksham Sharda | Thursday, March 03, 2022

The need for external files

Neurophysiology experiments often include natural videos (such as behaving animals), which need to be stored with the neurophysiological recordings in order to ensure maximal reusability of the data. These videos are commonly stored with lossy compression (e.g. h264 in an .mp4 file), which allows them to achieve very high compression ratios. It is possible to read these videos frame-by-frame, and store them in HDF5, but since HDF5 is not able to access popular video codecs like h264, the volume of the video in the NWB file is much larger (even when using the available compression algorithms like GZIP). NWB has an option to avoid storing these altogether by linking to these external video files using a relative path to that file on disk. This relative path is stored in the ImageSeries neurodata_type storing it as an attribute of a string dtype. We also need to publish these video linked NWB files in an archive (e.g. in DANDI). For DANDI, which renames and reorganizes the the NWB files, this requires not only uploading the video file on the archive but also changing the path attribute of the ImageSeries to reflect the new file names.

To implement this, we have created a formal naming convention for these video files relative to the NWB files’ path. In addition, these video files are also placed in a specific folder structure relative to the new location of the NWB file during the dandi organize call.

Internally the steps are as follows:

Organizing and renaming the video files with one of move/copy/symlink/hardlink in the new folder structure.
Updating the value of the external_file attribute in the NWB files.
Uploading on DANDI.

Note: this solution is specifically for natural videos like those of behaving animals. There are other types of image sequences like image stacks from optical physiology, which do not use codecs like h264; these types of videos can be copied into an HDF5 file.

Example re-organization

Original folder organization

├── nwbfiles
│   ├── test1_0_0.nwb
│   └── test1_1_1.nwb
└── video_files
    ├── test1_0.avi
    ├── test1_1.avi
    ├── test2_0.avi
    └── test2_1.avi

With the path attribute as: image_series.external_files=["../video_files/test1_0.avi", "../video_files/test1_1.avi"]

After `dandi organize`

The renaming pattern is as follows /<nwbfile_name>/{ImageSeries UUID}_external_file_{number}.mp4. This UUID is that assigned to the ImageSeries datatype when its created. Thus its possible to lookup a video file linked to an NWB file and vice versa.

└── dandi_organized
    ├── sub-mouse0
    │   ├── sub-mouse0_ses-sessionid0_image
    │   │   ├── 933f8cf6-9e4b-405f-8cad-cc031d1fafc9_external_file_0.avi
    │   │   └── 933f8cf6-9e4b-405f-8cad-cc031d1fafc9_external_file_1.avi
    │   └── sub-mouse0_ses-sessionid0_image.nwb
    └── sub-mouse1
        ├── sub-mouse1_ses-sessionid1_image
        │   ├── 03137112-9d42-46b6-9046-45bc9aa7eb5e_external_file_0.avi
        │   └── 03137112-9d42-46b6-9046-45bc9aa7eb5e_external_file_1.avi
        └── sub-mouse1_ses-sessionid1_image.nwb

With the renamed path attribute as

image_series.external_files=
["sub-mouse0_ses-sessionid0_image/933f8cf6-9e4b-405f-8cad-cc031d1fafc9_external_file_0.avi",
"sub-mouse0_ses-sessionid0_image/933f8cf6-9e4b-405f-8cad-cc031d1fafc9_external_file_1.avi"]

Code Walkthrough

Register dataset on DANDI (staging)

cd dandi_organized
dandi download "https://gui-staging.dandiarchive.org/#/dandiset/101391/draft"

Organize

cd dandi_organized
dandi organize -f "copy" --update-external-file-paths --media-files-mode "copy" "/nwbfiles"

–modify-external-file-fields option is a flag.

If active, the organise operation modifies the external_file field of an ImageSeries that holds the local location of an associated video file. It changes the value to the new name as per the convention above. If no external_file field found in all nwb files, but this option is active, then it logs a warning. If any NWB file’s ImageSeries has a external_file, but this option is not specified, then it raises a ValueError to avoid breaking the link.

–media-files-mode can be any of copy/move/symlink/hardlink.

This can only be specified if the –modify-external-file-fields flag is True. This is an optional argument, if not specified it defaults to “symlink”: an efficient way to deal with possibly large video files.

Validate

dandi validate

Upload

dandi upload -i dandi-staging "/dandi_organized"

Example dandiset here

download

This dataset can then be downloaded using:

mkdir dandi_download
cd dandi_download
dandi download "https://gui-staging.dandiarchive.org/#/dandiset/101391/draft"

The folder will contain all the video files along with the dandi metadata .yml and .nwb files.

DANDI User Workshop

Virtual hands-on workshop: Nov 1, 2021

By DANDI team | Wednesday, October 13, 2021

Click to view a recording of the workshop

Registration is closed

DANDI: A data archive and collaboration space for neurophysiology

By DANDI team | Sunday, September 12, 2021

DANDI is a US BRAIN Initiative supported data archive for publishing and sharing neurophysiology data including intracellular and extracellular electrophysiology, optophysiology, and behavioral time-series, and images from immunostaining experiments. For example, data from experimental techniques like patch clamps, silicon probes, and calcium imaging can be published on DANDI. So can data from lightsheet microscopy experiments when combined with associated MRI data.

DANDI allows scientists to publish “dandisets,” collections of data from neuroscience experiments that are often associated with specific papers or projects. DANDI is built to scale with the growing data engineering needs of the community and can easily receive and publish TBs of data for free supported by the AWS public dataset program. It also uses best practices in metadata curation using the DANDI schema and allows users to mint DOIs using Datacite for publishing dandisets.

For neurophysiology data, DANDI supports the Neurodata Without Borders (NWB) standard. The NWB standard ensures that data is shared with the necessary metadata for reanalysis, allows for analyses that aggregate data across dandisets, provides a common platform for building analysis and visualization tools for published data, and allows DANDI to easily extract metadata to provide powerful detailed search features. For MRI and related data, DANDI has been leveraging the BIDS standards and working with other groups to extend support to microscopy and electrophysiology.

With DANDI Hub, DANDI is more than just an archive, but in fact an entire cloud-based collaborative platform, available for free to all DANDI users! DANDI Hub is a JupyterHub with access to high performance multi-core compute nodes and high-bandwidth access to all data hosted on DANDI. DANDI Hub provides many different workflows to interact with your data, including Python, R, and Julia Jupyter notebooks, a terminal, and even a Linux desktop view, all deployed for free and on demand in the cloud. The hub can be used to share code that reproduces key figures from papers, or to interactively explore any dandiset.

You can navigate all data conveniently using the Web app, or for more control we have several software tools for interacting with DANDI. The DANDI Python library and command line interface allows for performant upload, download, and querying. The DANDI REST API can also be used to interface with web applications. Also, all dandisets are made available using Datalad, which provides convenient data access across different systems.

Upload your data today! See our user guide for details. If you have any questions, please use our help desk and we will be happy to assist.