This document provides all details needed to have access to the research collection of "BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression".
Any scientific publication derived from the use of this collection should explicitly refer to the following publication:
Anxo Pérez, Javier Parapar, Álvaro Barreiro, and Silvia López-Larrosa. 2023. BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3539618. 3591905
The BDI-Sen collection are available for research purposes under proper user agreements.
The dataset is provided in JSONL (JSON Lines) format. The main dataset files are:
bdi_unified.jsonl - Complete dataset with all symptom annotations per sentence bdi_majority_vote.jsonl - Dataset with majority-voted annotations
The dataset is also split into training, validation, and test sets in the splits/ directory:
train.jsonl / val.jsonl / test.jsonl - Standard splits train-with-control.jsonl / val-with-control.jsonl / test-with-control.jsonl - Splits with control sentences
Each entry in the JSONL files is a JSON object with the following structure:
{
"sentence": "<text>",
"annotations": [
{
"symptom": "<symptom_name>",
"severity": <0|1|2|3|null>,
"label": <0|1>
}
]
}
Fields:
• sentence: The text of the sentence being annotated
• annotations: List of symptom annotations for this sentence
• symptom: The BDI-II symptom category (21 symptoms total)
• severity: Severity level (0=none, 1=mild, 2=moderate, 3=severe, null=unassigned)
• label: Relevance to symptom (0=not relevant, 1=relevant)
BDI-Sen is a symptom-based dataset with relevant sentences that trace the presence of clinical symptoms. For this reason, we develop an annotation schema based on the BDI-II, a highly reliable tool to diagnose depression in clinical settings. The BDI-II covers 21 recognized symptoms, including emotional, cognitive and physical markers. Each item in the questionnaire has four alternative option responses scaled in severity from 0 (least severe) to 3 (most severe).
These options have a textual description associated. Table 1 provides an example of the option descriptions for the symptom Loss of energy. To create the BDI-Sen dataset, we used as data source the eRisk2019 depression severity collection, which contains social media users’ publications from Reddit and their responses to the BDI-II symptoms.
This collection can only be used for research purposes. To obtain the full collection that corresponds to our work, please fill in the following data agreement user agreement and send it to anxo.pvila@udc.es .