VideoPanda introduces a novel approach for synthesizing 360° videos conditioned on text or single-view video data. It leverages multi-view attention layers to augment a video diffusion model, enabling the generation of consistent multi-view videos that can be combined into immersive panoramic content. The model is jointly trained using text-only and single-view video conditions and supports autoregressive generation of long videos. Extensive evaluations demonstrate that VideoPanda generates more realistic and coherent 360° panoramas compared to existing methods.