First, we overfit a single-layer RNN, factoring a short segment of classical music into two components:
Then we map MediaPipe hand-tracking landmarks into the RNN's input space so that the learned network can be "played" via hand-movement.
Training details can be explored further here