How NorthBay Solutions Helped TÜBİTAK Launch a Foundational Turkish LLM on AWS

Meet Our Customer

The Scientific and Technological Research Council of Türkiye (TÜBİTAK) is Turkey's leading agency for advancing science, technology, and innovation. Established in 1963, TÜBİTAK's mission is to promote, manage, and fund scientific research while shaping Turkey's national R&D agenda. An autonomous organization, TÜBİTAK is governed by a Scientific Board composed of distinguished scholars from universities, industry, and research institutions. It plays a pivotal role in fostering a science-driven culture and supporting the country's technological growth.

The Challenge

TÜBİTAK set out to develop the first Turkish-language large language model (LLM). Their ambitious goals included:

  • Adapting tokenizers specifically for Turkish.
  • Experimenting with data mixing strategies.
  • Applying best practices for training models at scale.
  • Conducting a proof-of-concept (POC) to assess the model's viability, followed by fine-tuning the LLM for optimal performance.

Vision

TÜBİTAK's ultimate goal is to create a Turkish-first LLM that can serve as a foundation model for a variety of downstream applications. This model would empower future AI initiatives in Türkiye and provide an essential tool for further fine-tuning across different sectors.

Solution

Our collaboration with TÜBİTAK involved a comprehensive approach to building their LLM using AWS infrastructure:

  • Tokenizer Training: We developed new tokenizers optimized for Turkish, using a large, curated Turkish NLP dataset.
  • Tokenizer Evaluation: We assessed tokenizers through fertility scores and made iterative improvements to enhance efficiency.
  • Data Preparation & Sharding: TÜBİTAK’s data was preprocessed, tokenized, and sharded to maintain quality and maximize training performance.
  • AWS Pre-Training Code: Leveraging Fully Sharded Data Parallel (FSDP) techniques, we developed and validated pre-training code on AWS for optimized scaling.
  • Data Experiments: We ran experiments incorporating cultural data to improve model performance and generated insights based on results.
  • Full Pretraining: The LLM was pre-trained with a 5% English data mix, allowing for benchmarking against both Turkish and international standards.
  • Fine-Tuning: After pretraining, the LLM underwent multiple fine-tuning cycles to improve performance on the Turkish Leaderboard Dataset.
  • Model Merging: We experimented with merging models and conducted performance assessments to measure improvement over baseline results.
  • Comprehensive Evaluation: Metrics were captured throughout the process, enabling comparative analysis and identification of the most effective development strategies.

AWS Services

The project heavily utilized AWS services to ensure scalability and efficiency:

    Benefits

    TÜBİTAK achieved significant milestones through this initiative:

    10x Faster Training: AWS P5 instances dramatically reduced training time, enabling rapid iteration and experimentation.

    Turkish LLM Leaderboard Success: The fine-tuned model is now challenging the top-performing models on the Turkish LLM Leaderboard, marking a major leap in Turkish NLP capabilities.

    Next Steps

    TÜBİTAK is continuing its work to refine and fine-tune the LLM, with a vision to create an even more sophisticated Turkish-language model that will push the boundaries of AI and NLP in Türkiye.

    About NorthBay

    NorthBay Solutions is a leading provider of cutting-edge technology solutions, specializing in Generative AI, Cloud Migration, ML/AI, Data Lakes and Analytics, and Managed Services. As an AWS Premier Partner, we leverage the power of the cloud to deliver innovative and scalable solutions to clients across various industries, including Healthcare, Fintech, Logistics, Manufacturing, Retail, and Education.

    Our commitment to AWS extends to our partnerships with industry-leading companies like CloudRail-IIOT, RiverMeadow, and Snowflake. These collaborations enable us to offer comprehensive and tailored solutions that seamlessly integrate with AWS services, providing our clients with the best possible value and flexibility.

    With a global footprint spanning the NAMER (US & Canada), MEA (Kuwait, Qatar, UAE, KSA & Africa), Turkey, APAC (including Indonesia, Singapore, Hong Kong, Philippines and Vietnam), NorthBay Solutions is committed to providing exceptional service and support to businesses worldwide.

    We hold the following competencies, specializations and programs of AWS:

    • Generative AI and Machine Learning
    • Data & Analytics
    • DevOps
    • Mobile
    • Education
    • Migration Competency
    • MSP Partner
    • Public Sector Partner
    • Database Ready
    • Database Freedom
    • Solution Provider Partner
    • Well Architected