Product Info
Godel Base B‑1:
A 200M-Parameter Sparse Mixture-of-Experts Language Model
At Atom Technologies, we are rethinking how large language models can be scaled efficiently while remaining accessible and performant across diverse applications. Godel Base B‑1 represents the first milestone in this journey: a mid-sized, open-source model designed to balance parameter count, inference speed, and modular architecture.
Model Overview
Godel Base B‑1 is a 200 million parameter transformer language model that integrates a Mixture-of-Experts (MoE) routing layer to deliver greater capacity without linear growth in compute requirements.
This architecture is built from the ground up for:
Efficient inference under constrained hardware budgets
Modular scaling by adding additional experts
Compatibility with modern tokenizers (like our T‑1) and training pipelines
Architecture Details
Godel Base B‑1 combines:
Dense transformer backbone
A standard transformer encoder-decoder stack responsible for contextual representation learning.Sparse MoE layers
Selectively activated experts controlled by a gating network, allowing the model to dynamically specialize without incurring the full cost of dense layers.
Core Components
1. Embedding Layer
Vocabulary Size: 6,000 (T‑1 tokenizer)
Embedding Dimension: 512
2. Transformer Blocks
24 transformer layers
Multi-head self-attention (8 attention heads per layer)
Pre-layer normalization
Feed-forward hidden dimension: 2048
3. MoE Layer
16 experts in total
Sparse routing: Top-2 expert selection per token
Each expert: independent feed-forward network
Gating network: softmax gating probabilities to control expert activation
4. Output Projection
Linear projection to vocabulary logits
Softmax for final token probabilities
Mixture-of-Experts Design
Godel Base B‑1’s MoE implementation is inspired by approaches proven in research (Switch Transformer, DeepSeek-V3):
Sparse Activation: Only 2 out of 16 experts are active for each input token.
Load Balancing Loss: Auxiliary loss term encourages uniform utilization of experts.
Capacity Factor: Adjustable hyperparameter controlling expert selection granularity.
This enables a model with an effective capacity far larger than a dense 200M-parameter transformer, while maintaining efficient memory and compute usage.
Training Objectives
The model is pretrained on large-scale English and domain-specific corpora using:
Masked Language Modeling
Next Token Prediction
Auxiliary MoE load balancing regularization
Optimization uses AdamW with a cosine learning rate schedule and mixed-precision training for speed and stability.
Performance Goals
Godel Base B‑1 is designed to deliver:
Competitive perplexity relative to larger dense models
Faster inference due to sparse expert activation
Robust handling of domain-specific text without specialized fine-tuning
While smaller than flagship multi-billion parameter LLMs, B‑1 prioritizes deployability and adaptability.
Open-Source Roadmap
Godel Base B‑1 will be fully open-sourced, including:
Pretrained weights
Model architecture definitions (PyTorch)
Tokenizer artifacts
Training recipes and hyperparameters
Our goal is to make high-quality foundation models transparent, reproducible, and accessible to the broader ML community.
Use Cases
Potential applications include:
General-purpose text generation and summarization
Task-oriented dialog systems
Code completion and lightweight programming assistance
Research and experimentation with sparse expert models
Conclusion
Godel Base B‑1 is a new approach to scaling mid-sized language models: combining sparse Mixture-of-Experts with a streamlined transformer backbone and open-source philosophy. This architecture establishes a flexible foundation for future models and extensions, including larger parameter counts and more specialized expert modules.
Stay tuned for model releases, technical papers, and integration guides.



