Data & Models

* See our GitHub for an updated list.

  • ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic [link]
  • MegaCov: A billion-scale dataset for COVID-19 in 100+ languages [link]
  • DiaLex: A benchmark for evaluating dialectal Arabic word embeddings: [link]
  • NADI 2020 and 2021 shared task data (Arabic dialects): [link]
  • Arabic micro-dialects & models: [link]
  • Arabic manipulated and fake news data & models: [link]
  • English machine-generated data: [link]
  • Yoruba machine translation data: [link]
  • Arabic emotion detection data: [link]
  • Arabic dialect id benchmarks: [link]
  • Arabic Twitter word embeddings models (based on word2vec): [link]