Industry

Education

Solutions

Generative AI & AI/ML

Partner

AWS

AWS Services

Amazon SageMaker, AWS P5 Instances, AWS Model Registry

In the Case Study:

Download Case Study

Download

Have a Question?

Connect with Expert

How NorthBay Solutions Helped TÜBİTAK Launch a Foundational Turkish LLM on AWS

Meet Our Customer

The Scientific and Technological Research Council of Türkiye (TÜBİTAK) is Turkey's leading agency for advancing science, technology, and innovation. Established in 1963, TÜBİTAK's mission is to promote, manage, and fund scientific research while shaping Turkey's national R&D agenda. An autonomous organization, TÜBİTAK is governed by a Scientific Board composed of distinguished scholars from universities, industry, and research institutions. It plays a pivotal role in fostering a science-driven culture and supporting the country's technological growth.

The Challenge

TÜBİTAK set out to develop the first Turkish-language large language model (LLM). Their ambitious goals included:

Adapting tokenizers specifically for Turkish.
Experimenting with data mixing strategies.
Applying best practices for training models at scale.
Conducting a proof-of-concept (POC) to assess the model's viability, followed by fine-tuning the LLM for optimal performance.

Vision

TÜBİTAK's ultimate goal is to create a Turkish-first LLM that can serve as a foundation model for a variety of downstream applications. This model would empower future AI initiatives in Türkiye and provide an essential tool for further fine-tuning across different sectors.

Solution

Our collaboration with TÜBİTAK involved a comprehensive approach to building their LLM using AWS infrastructure:

Tokenizer Training: We developed new tokenizers optimized for Turkish, using a large, curated Turkish NLP dataset.
Tokenizer Evaluation: We assessed tokenizers through fertility scores and made iterative improvements to enhance efficiency.
Data Preparation & Sharding: TÜBİTAK’s data was preprocessed, tokenized, and sharded to maintain quality and maximize training performance.
AWS Pre-Training Code: Leveraging Fully Sharded Data Parallel (FSDP) techniques, we developed and validated pre-training code on AWS for optimized scaling.
Data Experiments: We ran experiments incorporating cultural data to improve model performance and generated insights based on results.
Full Pretraining: The LLM was pre-trained with a 5% English data mix, allowing for benchmarking against both Turkish and international standards.
Fine-Tuning: After pretraining, the LLM underwent multiple fine-tuning cycles to improve performance on the Turkish Leaderboard Dataset.
Model Merging: We experimented with merging models and conducted performance assessments to measure improvement over baseline results.
Comprehensive Evaluation: Metrics were captured throughout the process, enabling comparative analysis and identification of the most effective development strategies.

AWS Services

The project heavily utilized AWS services to ensure scalability and efficiency:

Amazon SageMaker, AWS P5 Instances, AWS Model Registry

Benefits

TÜBİTAK achieved significant milestones through this initiative:

10x Faster Training: AWS P5 instances dramatically reduced training time, enabling rapid iteration and experimentation.

Turkish LLM Leaderboard Success: The fine-tuned model is now challenging the top-performing models on the Turkish LLM Leaderboard, marking a major leap in Turkish NLP capabilities.

Next Steps

TÜBİTAK is continuing its work to refine and fine-tune the LLM, with a vision to create an even more sophisticated Turkish-language model that will push the boundaries of AI and NLP in Türkiye.

Download this case study in pdf format