How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos

Kun Su*, Xiulong Liu*, Eli Shlizerman (* equal contribution)

Introduction

This repository contains the code and the project page for the work How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos (paper) presented in NeurIPS 2021.

Please Turn Sound ON

skiing_demo.mov

Skiing video, drum soundtrack generated by RhythmicNet.(Please Turn the Sound ON)

Abstract

One of the primary purposes of video is to capture people and their unique activities. It is often the case that the experience of watching the video can be enhanced by adding a musical soundtrack that is in-sync with the rhythmic features of these activities. How would this soundtrack sound? Such a problem is challenging since little is known about capturing the rhythmic nature of free body movements. In this work, we explore this problem and propose a novel system, called ‘RhythmicNet’, which takes as an input a video with human movements and generates a soundtrack for it. RhythmicNet works directly with human movements, by extracting skeleton keypoints and implementing a sequence of models translating them to rhythmic sounds. RhythmicNet follows the natural process of music improvisation which includes the prescription of streams of the beat, the rhythm and the melody. In particular, RhythmicNet first infers the music beat and the style pattern from body keypoints per each frame to produce the rhythm. Next, it implements a transformerbased model to generate the hits of drum instruments and implements a U-net based model to generate the velocity and the offsets of the instruments. Additional types of instruments are added to the soundtrack by further conditioning on generated drum sounds. We evaluate RhythmicNet on large scale video datasets that include body movements with inherit sound association, such as dance, as well as ’in the wild’ internet videos of various movements and actions. We show that the method can generate plausible music that aligns with different types of human movements.

Motivation

In the ‘How does it Sound’ project we address the challenge of whether an AI system can generate rhythmic music corresponding to videos with human movements. We built a full pipeline named as 'RhythmicNet'. It takes frames of body keypoints as input and generates rhythmic music for that video. Note that our system can generalize on 'in-the-wild' videos, Here is an example of generated samples for volleyball video:

Generated Samples

We generate a variety of full soundtracks for different 'in-the-wild' human activity videos. Below we provide a few short samples.

conductor.mp4

Conductor video, drum music generated by RhythmicNet.

mime.mp4

Mime video, drum music generated by RhythmicNet.

mj.mov

Dance video of Michael Jackson, drum music generated by RhythmicNet.

freerun.mov

Free Run video, drum music generated by RhythmicNet.

volleyball.mov

Volleyball video, drum music generated by RhythmicNet.

tiktokgestdance.mp4

Tiktok Activity video, drum music generated by RhythmicNet.

soccertricks.mp4

Soccer Tricks video, drum+piano music generated by RhythmicNet.

jumpsinglehop.mp4

Rope Jumping video, drum+piano music generated by RhythmicNet.

Generated Drum with Background Music

RhythmicNet can generate drum music as an accompaniment to selected Background Music.

manskate_shallow.mp4

Figure Skating Video with Song 'Shallow' accompanied by generated drum.

lift_we_are_the_champion.mp4

Lifting activity Video with Song 'We are the Champion' accompanied by generated drum.

Components of RhythmicNet

RhythmicNet contains three major components:

Video2Rhythm : Beat prediction + Style Extraction to compose Music Rhythm
Rhythm2Drum: Drum Music Generation Based on Music Rhythm
Drum2Music: Full Music (with Piano or Guitar) generation based on Drum Track

The components are shown in the figure above. We describe each component separately below.

System overview of RhythmicNet.

Beat prediction + Style Extraction (Video2Rhythm)

We decompose the rhythm into two streams: beats and style. We propose a novel model to predict music beats and a kinematic offsets based approach to extract style patterns from human movements. Beat is a binary periodic signal determined by fixed tempo, and it is obtained by music beat prediction network, which learns the beat by pairing body keypoints with ground truth music beats in a supervised way. We propose a novel graph-based attention model that can well capture the regular pattern of beats from human body movements. In addition to beats, irregular patterns also exist in the rhythm, which is also called style. To extract style, we represent the changing motions as kinematic offsets, and apply spectral analysis technique on top of it to extract frames with fast-changing motions, which is the style pattern. By combining beats and style, we get the rhythm.

Drum Music Generation Based on Music Rhythm (Rhythm2Drum)

Rhythm2Drum interprets the provided rhythm from previous stage into drum sounds. Each drum track can be represented by three matrices: hits, velocities, and offsets. The hits represent the presence of drum onsets, velocity represents the loudness of hits, and offsets represent the time shift of each hit. Rhythm2Drum contains 2 stages. In the first stage, it implements an Encoder-Decoder Transformer that given the rhythm generates the drums performance Hits (onsets). The Transformer generates natural and diverse drum onsets by autoregressively learning the hits as a word sequence conditioned on the rhythm. In the second stage, a U-net type model generates the drum Velocities and Offsets. The U-Net learns them conditioned on the Hits matrix.

Full Music (with Piano or Guitar) generation based on Drum Track (Drum2Music)

In this last stage we add further instruments to enrich the soundtrack. Since the drum track contains rhythmic music, we propose to condition the additional instrument stream on the generated drum track. We consider the piano or guitar as the additional instruments, since these are dominant instruments. Drum2Music completes the drum music by adopting an Encoder-Decoder architecture using transformer-XL to generate a music track of either piano or guitar conditioning on the generated drum performance.

Contact

For more information, contact NeuroAI lab at University of Washington Eli Shlizerman (PI) [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos

Introduction

Abstract

Motivation

Generated Samples

Generated Drum with Background Music

Components of RhythmicNet

Beat prediction + Style Extraction (Video2Rhythm)

Drum Music Generation Based on Music Rhythm (Rhythm2Drum)

Full Music (with Piano or Guitar) generation based on Drum Track (Drum2Music)

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

shlizee/RhythmicNet

Folders and files

Latest commit

History

Repository files navigation

How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos

Introduction

Abstract

Motivation

Generated Samples

Generated Drum with Background Music

Components of RhythmicNet

Beat prediction + Style Extraction (Video2Rhythm)

Drum Music Generation Based on Music Rhythm (Rhythm2Drum)

Full Music (with Piano or Guitar) generation based on Drum Track (Drum2Music)

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages