This document provides all details needed to have access to the research collection of "BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression".

Any scientific publication derived from the use of this collection should explicitly refer to the following publication:

Anxo Pérez, Javier Parapar, Álvaro Barreiro, and Silvia López-Larrosa. 2023. BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3539618. 3591905

The BDI-Sen collection are available for research purposes under proper user agreements.

Data

Structure of the dataset (relevants and control all from BdiSen). All appended labels are in the next csv's:

					train-with-severities-and-multilabels.csv
					val-with-severities-and-multilabels.csv
					test-with-severities-and-multilabels.csv

The columns are (Sentence, Label). It includes all the symptoms considered from the BDI-II and if it is relevant for each one.
The excel files (.xlsx), correspond to the original excel files used by the annotators. We used them to append all the symptom severities together.

***Important (balance of the labels):***

(a) In the case of symptom detection, we balanced the labels in train/val splits to have the same number of relevants (1) and non-relevants (0).
(b) In the case of severity detection, the number of non-relevants in train/val splits are less. In this case, we included as many samples from the control class (4) as the max number of training samples from the remaining other classes (0,1,2,3).

The BDI-Sen Dataset

BDI-Sen is a symptom-based dataset with relevant sentences that trace the presence of clinical symptoms. For this reason, we develop an annotation schema based on the BDI-II, a highly reliable tool to diagnose depression in clinical settings. The BDI-II covers 21 recognized symptoms, including emotional, cognitive and physical markers. Each item in the questionnaire has four alternative option responses scaled in severity from 0 (least severe) to 3 (most severe).

These options have a textual description associated. Table 1 provides an example of the option descriptions for the symptom Loss of energy. To create the BDI-Sen dataset, we used as data source the eRisk2019 depression severity collection, which contains social media users’ publications from Reddit and their responses to the BDI-II symptoms.

How to obtain the sentences

This collection can only be used for research purposes. To obtain the full collection that corresponds to our work, please fill in the following data agreement user agreement and send it to anxo.pvila@udc.es .