الجمعة، 31 مايو 2013

KALIMAT Multipurpose Arabic Corpus

KALIMAT a Multipurpose Arabic Corpus

We are pleased to announce the immediate availability of KALIMAT 1.0,

KALIMAT is an Arabic natural language resource that consists of:
1) 20,291 Arabic articles collected from the Omani newspaper Alwatan by (Abbas et al. 2011).
2) 20,291 Extractive Single-document system summaries.
3) 2,057 Extractive Multi-document system summaries.
4) 20,291 Named Entity Recognised articles.
5) 20,291 Part of Speech Tagged articles.
6) 20,291 Morphologically Analyse articles.

The data collection articles fall into six categories:
culture, economy, local-news, international-news, religion, and sports.

The process of creating KALIMAT was applied to the entire data collection (20,291 articles).

ARABIC LEARNER CORPUS v1 المدونة اللغوية لمتعلمي اللغة العربية

ARABIC LEARNER CORPUS v1

Contents
The first version of the Arabic learner corpus (ALC) comprises a collection of texts written by learners of Arabic in Saudi Arabia. The corpus covers two types of students, non-native Arabic speakers (NNAS) learning Arabic as a second language (ASL) for academic purpose (AAP), and native Arabic speaking students (NAS) learning to improve their written Arabic. Both groups are males in pre-university level.
The current version of ALC has been captured in November and December 2012, and it includes a total of 31272 words, 215 written texts (narrative and discussion) produced by 92 students from 24 nationalities and 26 different L1 backgrounds. 181 texts (84%) were written in class (timed essays), while 34 (16%) produced at home (untimed essays). Average length of the texts is 145 words. 95% of the texts were hand-written, so they had to be transcribed into a suitable computerised form. All identity information (e.g, names, contacts, dates of birth, etc.) have been removed from transcriptions.
Files format
Three types of non-annotated files have been generated: (1) with no header, (2) with metadata header in Arabic, (3) and in English. They are available in two formats, txt and XML. The metadata information enables researchers to identify characteristics of text and its producer in each transcription. The original hand-written sheets are also available after they have been scanned and saved into PDF-format files.
All corpus files were named in a method which indicates the basic characteristics of the text and its author (e.g. S038_T2_M_Pre_NNAS_W_C). They are in order, student identifier number, text number, author gender, level of study, nativeness, text mode, and place of text production.
Future work
AS a next stage, the entire corpus will be annotated for errors, and word-tagged with morphological tags to identify part of speech and certain grammatical sub-categories. Additionally, the correct form will be reconstructed by correcting the mistakes all.
Annotating errors will be performed using a detailed error-type tagset, which has been developed for Arabic learner corpora in general and to be used in the present corpus in particular (Alfaifi and Atwell, 2012). In future, further version will be issued including more materials (written and spoken), different genders (male and female), and different levels of study (pre-university and university).
References
Alfaifi, A. and Atwell, E. (2012). Arabic Learner Corpora (ALC): A Taxonomy of Coding Errors. In: the 8th International Computing Conference in Arabic (ICCA 2012), 26 - 28 December 2012, Cairo, Egypt.

للمزيد و للتحميل
http://www.comp.leeds.ac.uk/scayga/alc/index.html