Dumb way to build ai

This blog is a brief introduction towards my extensive research in optimisers and ways to reduce the size of llms keeping the out puts unhindered.

Aditya Prasad Panigrahi

8/15/20251 min read

Description

Quick summary of what i'm doing

Efficiency

3x more effient results so a 200 ml model will output result equivant to a 650 - 700 ml model.

INTRO

The DatalogyAI team’s BeyondWeb paper https://arxiv.org/pdf/2508.10975 showed that synthetic data can make language models up to 6× more efficient than those trained only on raw web text.

I’m taking this further: instead of just training on synthetic corpora, I’m starting at the tokenizer stage. By rephrasing raw text with an LLM, building a synthetic dataset, and then training a 150–200M model on top, the goal is to reach the power of models 4–5× larger with far less compute.

The data set :

Pipeline:

  • Core dataset Open-web

  • Feeding the raw dataset into Llma-3 to rephrase it

  • Creating a larger dataset from these rephrased outputs from the Llma-3

This makes a Synthetic Dataset for pretraining.

Rebuild the dataset

follow the below info section

black blue and yellow textile
black blue and yellow textile