It started with a bet. Someone claimed that existing AI music models couldn't produce anything that sounded genuinely Afrobeats: not just "African-sounding" but actually right. Someone else bet they could fine-tune one that could. It was a Friday afternoon. We should have gone home.
Three months later, we have a system called Gbẹdu that generates original Afrobeats tracks with code-switched Yoruba/English lyrics, a synthesized vocal persona, and a polyrhythmic drum arrangement that, and we're slightly embarrassed to admit how proud this makes us, does not sound like corporate hold music with drums. The bet cost us a quarter's worth of GPU time and an amount of collective emotional investment that we have agreed not to quantify.
This is the honest account of how we got here. Including the parts that didn't work.
Why Western Music Models Can't Do Afrobeats
Every major audio generation model trained before 2025 (MusicGen, AudioCraft, Stable Audio, the original ACE-Step) has one thing in common: the training data is overwhelmingly Western. LAION-Audio, the dataset that underlies most of these systems, is roughly 630,000 hours of audio sourced primarily from Freesound, YouTube-AudioSet, and licensed Western catalogues. The African music in there amounts to statistical noise.
That's a representation problem, but it's not the only problem. Afrobeats operates in fundamentally different rhythmic space than Western pop, and "add more African songs" doesn't fix a model that has internalized the wrong rhythmic grammar at a deep level.
Specifically:
- 12/8 feel. Much of Afrobeats runs on compound time, a triplet subdivision underneath what looks like a 4/4 groove. Western models trained on 4/4 grids generate music that is technically in time but feels metronomically wrong. Like a metronome performing a heartbeat. Correct tempo, wrong organism.
- Polyrhythmic layering. A standard Afrobeats drum arrangement puts the kick, hi-hat, and talking drum in a three-way conversation where none of them is fully in charge. Western models collapse this into one undifferentiated drum texture. The result sounds like someone described Afrobeats to a model that had only ever listened to The Chainsmokers.
- Call-and-response structure. The lead vocal in Afrobeats answers things: percussion phrases, horn lines, backing harmonies. Models trained on Western pop, where the melody sits on top of everything like a decorative hat, generate vocals that address nothing and no one.
- Silence as information. Great Afrobeats producers use silence with intention. The space between notes is structural, not empty. A model that has never learned this treats silence as a bug to be filled.
The Training Data Problem
We needed licensed Afrobeats. Not scraped, not "similar enough," not public domain folk recordings that happen to be from West Africa. Contemporary Afrobeats, with proper provenance and artists who understood what they were signing.
We reached out to 14 labels and independent artist collectives across Lagos, Nairobi, Accra, and Kampala. Eight responded. Three said yes immediately, one of them within four hours, which we took as a sign that the idea wasn't completely unhinged. One label ghosted us for three weeks, saw an early demo, and came back. Two said no.
The two rejections were interesting. One was straightforward: they didn't want to participate in AI music generation on principle, and they explained their reasoning clearly. We have no argument with that. The other was more specific: they were worried about what this meant for producers. What happens to the session drummer when the session is a prompt? It's a legitimate concern and one we still think about. We don't have a clean answer.
The artists who did participate signed a three-page licensing agreement with explicit terms: training use only, no commercial release of outputs that would compete with their catalogues, revenue sharing provisions if Gbẹdu ever generates commercial revenue, and a clause giving them right of refusal on specific output use cases we haven't imagined yet. We paid a flat licensing fee per track. This was not a "we'll credit you on the website" arrangement.
The final dataset: somewhere north of 40,000 tracks spanning releases from 1998 to 2025. The metadata situation was, diplomatically, a project in itself: most independent releases come with artist name, title, and year, and nothing else. No BPM, no key, no instrumentation labels, no structural markers. We built an annotation pipeline to address this and spent more time on metadata than we care to admit.
The Fine-Tuning: Twenty-Two Days and One Discarded Run
We chose ACE-Step 1.5 as our base model. It has a cleaner audio codec than MusicGen's EnCodec on high-frequency reconstruction, which matters for Afrobeats, where the hi-hat and shekere textures live in the upper frequency range, and its conditioning architecture accepts custom metadata embeddings without major surgery.
Four A100s. Three weeks of training, technically. Four weeks if you count the run we killed at day nine because the model was rhythmically hedging: it had learned to generate music that was spectrally correct but metrically uncommitted, sitting ambiguously between 4/4 and 12/8 rather than committing to either. The fix was a custom rhythmic consistency loss term. The restart cost us a week and a small amount of collective morale.
The loss curves looked fine by epoch 12. The outputs were still wrong. Technically competent, soulless. The kind of Afrobeats that would play in a stock footage library under a video titled "African Business Meeting: Diverse Team."
Around epoch 20, something clicked. We genuinely do not know exactly what changed. The groove appeared. The percussion started having opinions. We played it for someone who wasn't part of the project and they moved their head slightly, involuntarily, which we counted as success.
Language: Code-Switching at 110 BPM
Afrobeats lyrics do not stay in one language. A typical verse opens in Yoruba, drops into Lagos Pidgin for the hook, throws in some Swahili slang in the bridge, and lands the punchline in English. This isn't stylistic flourish: it's the expressive texture of the genre. The language shift means something. When the verse goes Yoruba-to-Pidgin, you're moving from formality to intimacy. When it lands in English, you're usually either flexing or being sincere.
Standard TTS systems are completely useless here. They assume monolingual input and produce output that sounds like a very confident GPS navigation system attempting to hype a crowd.
We fine-tuned Llama-3 on a lyric corpus extracted from our transcribed dataset, approximately 350,000 lines with natural code-switching preserved. Early versions were bad in ways that a native Yoruba speaker would find actively offensive. The tonal language rendered as phoneme sequences was wrong at the semantic level, not just the acoustic level. Version 4 is better. We would not call it fluent. We would call it good enough that the wrongness reads as stylistic choice rather than incompetence, which is a humbler bar than we started with.
The insight that helped most: giving the model an explicit language mix target. When we instructed it to generate with a 40/40/20 split of Yoruba/English/Pidgin, it stopped defaulting to English for the "important" lines (the hook, the outro) and treating Yoruba as decoration. The resulting lyrics read like actual Afrobeats lyrics rather than English pop with Yoruba sprinkled for atmosphere.
Voice: What We Built and What We Didn't
We are not cloning anyone's voice. This matters enough to say directly, before the description of how we built the vocal component.
We use RVC v2 (Retrieval-based Voice Conversion v2). The approach: instead of synthesizing a voice from scratch, RVC transfers the timbre and phrasing characteristics of a source singer onto an input vocal performance. We worked with a professional session singer in Nairobi, contracted specifically for this, with a scope-of-use agreement that limits application to Gbẹdu demos, includes revenue sharing provisions, and gives him right of refusal on specific output cases. We record reference vocals in sessions with him, then use RVC to build a consistent artificial vocal persona from those sessions.
The persona has a name internally. We're not publishing it. It doesn't correspond to any existing public figure. This is how AI voice should be handled and we are fully aware that most of the industry disagrees with us by example if not by argument.
What Works, and the One Thing That Still Doesn't
The groove. The energy. The production aesthetic: the compression, the reverb character, the way the 808 sits in the low end. Feed Gbẹdu "mid-tempo Afropop, Yoruba, nostalgic mood, talking drum present" and the result sounds like it belongs on a 2 AM playlist. We mean that as the highest possible compliment. 2 AM playlists are curated by people who know what they want.
The polyrhythmic layering is noticeably better than any Western model we've tested. The code-switched lyrics are authentic enough that two of our evaluators asked if we'd sampled real songs. This felt good. We did not share it with the investors.
We are working on this. We are not confident we will solve it without a fundamentally different approach to how the vocal content is generated before RVC gets to it. The talking drum semantics are a separate, possibly harder problem: we can generate textures that sound correct but cannot yet generate talking drum parts that actually say something in the tonal language sense. These are the problems where the gap between "almost" and "actually" is doing a lot of work.
The Use Case We Didn't Design For
A producer in Lagos has been using Gbẹdu for demo tracks. Not finished productions. Not anything he's releasing. Rough sketches, to show clients a direction before he commits studio time. He said it saves him roughly three hours per pitch.
We didn't design for that. We were thinking about the end product, the finished track. He's using it as a communication tool, a fast mockup generator to get himself and a client into the same room with a shared reference point before the real work starts. "It's like having a session musician who can only play demo takes but plays them instantly." That framing has, somewhat embarrassingly, completely reorganized how we think about the product roadmap.
Three other producers in Nairobi and one in Accra are in the same private beta. The feedback is consistent: not good enough to release, good enough to pitch. This is exactly the right bar for what it currently is.
We're not trying to replace Afrobeats artists. That framing is both wrong and, frankly, offensive to the craft. The question we're more interested in is whether there's a tool that compresses the pre-production phase (the ideation, the rough sketching, the client alignment) in a way that gives producers more time for the part only they can do.
Whether the answer to that question is yes at scale is something we'll know more about in the next six months.
What's Next
The waitlist for Gbẹdu beta is open at datacraft.co.ke/labs/gbedu. We're prioritising producers and songwriters who will give us honest feedback over people who will be polite about it. If you think it's bad, we need to know specifically how it's bad.
On the technical side: hierarchical structure planning (the model drifts after about 90 seconds of audio, which is a problem), an expanded Swahili corpus for Gengetone-specific outputs, and a first attempt at a dedicated melismatic vocal module that we are genuinely not sure will work. We are committing to attempting it, not to succeeding.
Our investors have not been formally briefed on this project.
Gbẹdu started as a Friday afternoon bet. It's now three months of work, four A100s, tens of thousands of licensed tracks, and a lot of evenings listening to outputs that were almost right but not quite. The almost is what keeps you going. The gap between almost-Afrobeats and actual Afrobeats is where all the interesting technical problems are, and if we're honest, it's also where all the interesting questions about what AI music generation is actually for are hiding too.
We'll be back with more when we have more to show. Until then: join the waitlist, and if you're a producer who thinks we're thinking about this wrong, please tell us. We are extremely open to being told we're thinking about this wrong.