Research At Atom

Rebuild the dataset

follow the below info section

Pipeline

Github Repo

Git Hub

The DatalogyAI team’s BeyondWeb paper https://arxiv.org/pdf/2508.10975 showed that synthetic data can make language models up to 6× more efficient than those trained only on raw web text.

I’m taking this further: instead of just training on synthetic corpora, I’m starting at the tokenizer stage. By rephrasing raw text with an LLM, building a synthetic dataset, and then training a 150–200M model on top, the goal is to reach the power of models 4–5× larger with far less compute.

Dumb way to build ai

**this is introduction to what i have build i am still working on the llm when completed you'll have the github links to them.

Description

Tokenizer(32k)

A model’s tokenizer decides how raw text is broken into units. Most modern LLMs use vocabularies in the 30–50k range, with 32k being a sweet spot: large enough to capture diverse words and phrases, but small enough to stay efficient.

By training the tokenizer on synthetic rephrasings instead of messy raw web data, the vocabulary can be tuned to capture cleaner, denser representations of language. This means the model wastes fewer tokens on noise and redundancy.

The result: when a 150–200M model is trained with this tokenizer and synthetic data, it can punch far above its weight — achieving the efficiency of models 4–5× larger, while demanding only a fraction of the compute

The 150M+ Model (with MoE)

On top of the tokenizer, I’m building a 150–200M parameter model, designed as a Mixture of Experts (MoE). Instead of scaling to billions of parameters, MoE lets different “experts” specialize, while only a subset activates for each input.

This means the model stays lightweight in compute but still gains the expressive power of something much larger. Combined with a synthetic-trained tokenizer, the aim is for this 150M-class model to deliver results closer to a 600M–1B model — but at a fraction of the cost.Write your text here...