ArabCeleb

ArabCeleb is an audio dataset collected in the wild that specifically focuses on arabic language. The proposed dataset contains 1930 utterances from 100 celebrities taken from video on YouTube.com. The dataset might be used for several speaker recognition tasks: identification, verification, gender recognition as well as multimodal recognition tasks thus integrating audio and video tracks.

To allow the training of methods for speaker identification that can then be reused for speaker verification, we generate the development and test sets making sure that there is no overlap between the speakers of the development and test sets. The development set is further divided into training, validation, and test sets for speaker identification.

Dependencies

Python 3.8
pytube 11.0.1
ffmpeg 4.2.4

In order to successfully run the code, install the packages listed in requirements.txt as follows:

pip install -r requirements.txt

Downloads

We provide Youtube URLs, timestamps for utterances, and speaker metadata.

URLs and timestamps

We provide URLs for each YouTube video and timestamps for utterances into the file utterance_info.json.

Audio files

The audio files can be downloaded using the information provided into the file info.json running the script prepare_dataset.py as follows:

python prepare_dataset.py

The script:

Download the video at the given Youtube URL
Cut the entire video into video sequences
Extract and save the audio signal into wav a file

Metadata

Full names, year of born, and gender labels for all the speakers in the dataset can be found in speaker_info.csv.

Dataset split for identification

List of trial pairs for verification

License

The ArabCeleb dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.

Caution: We note that the distribution of identities in the ArabCeleb datasets may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained on this data.

Please contact the authors below if you have any queries regarding the dataset.

Citation

Please cite the following if you make use of the dataset:

Simone Bianco, Luigi Celona, Intissar Khalifa, Paolo Napoletano, Alexey Petrovsky, Flavio Piccoli, Raimondo Schettini, and Ivan Shanin. ArabCeleb: Speaker Recognition in Arabic. In AIxIA 2021 - Advances in Artificial Intelligence, Springer, pp. 338-347, 2022.

@inproceedings{bianco2022arabceleb,
 author = {Bianco, Simone and Celona, Luigi and Khalifa, Intissar and Napoletano, Paolo and Petrovsky, Alexey and Piccoli, Flavio and Schettini, Raimondo and Shanin, Ivan},
 booktitle="AIxIA 2021 -- Advances in Artificial Intelligence",
 year="2022",
 publisher="Springer International Publishing",
 address="Cham",
 pages="338--347",
 title = {ArabCeleb: Speaker Recognition in Arabic},
 isbn="978-3-031-08421-8"
}