The Arabic language includes the sixth official United Nations languages. It is broadly used and spoken throughout the Middle Eastern countries by 420 million people. Arabic is one of the oldest, culturally rich, largest and influential languages throughout world history. However, it becomes less considerable when it comes to natural language processing.

Arabic itself is a broad language with different dialects, which makes it ambiguous and challenging to understand because of the language complexity, and diversified. The wide-ranging words and vocal tone also poses many challenges when it comes to computational processing of the language and its data, which is central to NLP algorithm. Therefore, collecting and computing Arabic data for training students, teachers, learners and professional become complicated and a significant roadblock in expanding the language across the world.

Here is a list for Arabic enthusiasts to find out impeccable Arabic Text datasets offer to the public for machine learning.

Best Arabic Text Datasets:

Quranic Arabic Corpus: It is an annotated linguistic resource that shows Arabic grammar, morphology and syntax related to every single word of the Holy Quran.

Corpus of Contemporary Arabic (CCA): From language engineers, professional, language teachers to foreigners learning Arabic. Every individual can benefit from this corpus containing approx. One million annotated Arabic words/vocabulary for the learners.

Arabic Learner Corpus (ALC): This student-oriented corpus serves best for learners as it constitutes of approx. 0.2 million Arabic vocabulary contributed by 942 students. You can find the written and spoken manuals or Arabic materials that have been collected from Arabic apprentice in Saudi Arabia.

Arabic Poetry Dataset: This dataset is designed explicitly for poetry; it has a massive collection of 58 thousand poems scraped and collected from The poem included in this database is as old from 6th century to present modern time poem. The database also shows metadata as the category of poetry and its name.

