Video Stimuli

Decoding

Introducing

In this paper, our contributions are:

We build a new dataset SEED-DV, recording 20 subjects EEG data when viewing 1400 video clips of 40 concepts.
We annotate each video clip, forming EEG-VP and video reconstruction benchmark.
We propose EEG2Video, a framework for decoding videos from EEG using Seq2Seq and DANA modules with with an inflated Stable Diffusion model.

Motivation & Challenge

Previous research reconstructed videos from fMRI data. However, fMRI data is low in temporal resolution (0.5 Hz), motivating us to turn to high temporal resolution neuroimaging techniques. EEG usually has a sampling rate of 1000 Hz, but there are still tons of challenges:

No suitable EEG dataset.
The low spatial resolution of EEG (≈ 60 sensors) is far less than that of fMRI (≈ 100,000 voxels).
The EEG's signal-to-noise ratio (SNR) is quite large.
The upper limit of the EEG's decoding capability remains unclear.

Video Stimuli Selection

We eleborately select 40 concepts across 9 coarser classes to build our dataset: Land Animal, Water Animal, Plant, Exercise, Human, Nutural Scene, Food, Musical Instrument, Transportation.

We choose natural videos rather than artificial ones (like anime).
We try to cover as many natural scenes as possible.
We would like to involve roughly 1/3 classes with human beings, 1/3 classes with animals and plants, and 1/3 of non-living scenes or objects.
We would like to have roughly 1/2 videos with rapidly changing scenes, and the other half with relatively static objects.
We would like to balance the numbers of the main colors.

Expeirment Protocol

We recorded 20 subjects' EEG data while they were viewing video stimuli.
We collected 35 video clips for each concept.

Subjects watched 7 video blocks in total. There is a rest phase between each two blocks.
Each block includes 40 concepts, the order of these concepts is random across blocks.
Subjects were first informed of the next concept, then watched 5 video clips of the informed concept.

Meta Information Annotation: EEG-VP Benchmark

We manually annotated some meta information to fully investigate the EEG's decoding capability.

Human: the appearance of humans: {Yes, No }.
Face: the appearance of human faces: {Yes, No }.
Number: the number of the main objects: {One, Two, Many }.
Color: the color of the main objects: {Blue, Green, Red, Grey, White, Yellow, Colorful }.
Opitical Flow Score: the optical flow score of the video.

EEG-VP Results

We evaluate a bunch of EEG models on the EEG-VP benchmark and conclude some findings:

We can decode Categories information from EEG signals.
We can decode Color information from EEG signals.
We can decode Dynamic information from EEG signals.
We cannot decode numbers, appearance of humans or faces from EEG signals.

EEG2Video Framework

In this paper, we propose EEG2Video, a pipeline for reconstructing videos from EEG signals. To deal with the high temporal resolution but low spatial resolution brain signals, we design several modules based on the results on the EEG-VP benchmark to better decode videos.

We use a Seq2Seq model for densly aligning eeg embeddings with low-level visual information.
We use a Semantic predictor for aligning eeg embeddings with semantic information.
We design the DANA modules to introduce the fast/slow information into the diffusion process.
We leverage the inflated diffusion models for decoding vivid videos.

More Samples

Stimuli Decoding Stimuli Decoding

moresample

Fail Cases

Stimuli Decoding Stimuli Decoding

moresample

We present some failure samples, these failures are typically caused by the inability of the model to infer either the semantic information or the low-level visual information correctly, resulting the irrelevantly generated videos.

Dataset Summary & Application

The SEED-DV dataset is composed of two parts:

EEG: The raw EEG signals of 20 subjects, with 62 EEG channels and sampling rate of 200 Hz.
Video: The video stimuli of 7 video blocks, the meta-info founder contains the meta information for the EEG-VP benchmark, the BLIP-caption founder contains all captions for 1400 video clips.

You need to fill out the Licenses and Apply the SEED-DV dataset.

Acknowledgments

This website is crafted by Xuan-Hao Liu and Zheng Wang. Zheng Wang put in tremendous efforts for beautifying this website. And thanks all members in BCMI Lab for all the support and help. Sincere thanks to all subjects who participated in our experiment!

Huge thanks the Stable Diffusion Team for opensourcing their high-quality AIGC models. Gratitude to the Tune-A-Video Team for their elegant text-to-video model. And kudos to the Mind-Video Team for their pioneering and excellent fMRI-to-video work.