ARABIC LEARNER CORPUS v1
Contents
The first version of the Arabic learner corpus (ALC) comprises a collection of texts written by learners of Arabic in Saudi Arabia. The corpus covers two types of students, non-native Arabic speakers (NNAS) learning Arabic as a second language (ASL) for academic purpose (AAP), and native Arabic speaking students (NAS) learning to improve their written Arabic. Both groups are males in pre-university level.
The first version of the Arabic learner corpus (ALC) comprises a collection of texts written by learners of Arabic in Saudi Arabia. The corpus covers two types of students, non-native Arabic speakers (NNAS) learning Arabic as a second language (ASL) for academic purpose (AAP), and native Arabic speaking students (NAS) learning to improve their written Arabic. Both groups are males in pre-university level.
The current version of ALC has been
captured in November and December 2012, and it includes a total of 31272
words, 215 written texts (narrative and discussion) produced by 92
students from 24 nationalities and 26 different L1 backgrounds. 181
texts (84%) were written in class (timed essays), while 34 (16%)
produced at home (untimed essays). Average length of the texts is 145
words. 95% of the texts were hand-written, so they had to be transcribed
into a suitable computerised form. All identity information (e.g,
names, contacts, dates of birth, etc.) have been removed from
transcriptions.
Files format
Three types of non-annotated files have been generated: (1) with no header, (2) with metadata header in Arabic, (3) and in English. They are available in two formats, txt and XML. The metadata information enables researchers to identify characteristics of text and its producer in each transcription. The original hand-written sheets are also available after they have been scanned and saved into PDF-format files.
Three types of non-annotated files have been generated: (1) with no header, (2) with metadata header in Arabic, (3) and in English. They are available in two formats, txt and XML. The metadata information enables researchers to identify characteristics of text and its producer in each transcription. The original hand-written sheets are also available after they have been scanned and saved into PDF-format files.
All corpus files were named in a method
which indicates the basic characteristics of the text and its author
(e.g. S038_T2_M_Pre_NNAS_W_C). They are in order, student identifier
number, text number, author gender, level of study, nativeness, text
mode, and place of text production.
Future work
AS a next stage, the entire corpus will be annotated for errors, and word-tagged with morphological tags to identify part of speech and certain grammatical sub-categories. Additionally, the correct form will be reconstructed by correcting the mistakes all.
AS a next stage, the entire corpus will be annotated for errors, and word-tagged with morphological tags to identify part of speech and certain grammatical sub-categories. Additionally, the correct form will be reconstructed by correcting the mistakes all.
Annotating errors will be performed
using a detailed error-type tagset, which has been developed for Arabic
learner corpora in general and to be used in the present corpus in
particular (Alfaifi and Atwell, 2012). In future, further version will
be issued including more materials (written and spoken), different
genders (male and female), and different levels of study (pre-university
and university).
References
Alfaifi, A. and Atwell, E. (2012). Arabic Learner Corpora (ALC): A Taxonomy of Coding Errors. In: the 8th International Computing Conference in Arabic (ICCA 2012), 26 - 28 December 2012, Cairo, Egypt.
Alfaifi, A. and Atwell, E. (2012). Arabic Learner Corpora (ALC): A Taxonomy of Coding Errors. In: the 8th International Computing Conference in Arabic (ICCA 2012), 26 - 28 December 2012, Cairo, Egypt.
للمزيد و للتحميل
http://www.comp.leeds.ac.uk/scayga/alc/index.html
ليست هناك تعليقات:
إرسال تعليق