This document provides all details needed to have access to the research collection of "BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression".

Any scientific publication derived from the use of this collection should explicitly refer to the following publication:

Anxo Pérez, Javier Parapar, Álvaro Barreiro, and Silvia López-Larrosa. 2023. BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3539618. 3591905

The BDI-Sen collection are available for research purposes under proper user agreements.

Data

The dataset is provided in JSONL (JSON Lines) format. The main dataset files are:

					bdi_unified.jsonl - Complete dataset with all symptom annotations per sentence
					bdi_majority_vote.jsonl - Dataset with majority-voted annotations
				

The dataset is also split into training, validation, and test sets in the splits/ directory:

					train.jsonl / val.jsonl / test.jsonl - Standard splits
					train-with-control.jsonl / val-with-control.jsonl / test-with-control.jsonl - Splits with control sentences
				

Dataset Format

Each entry in the JSONL files is a JSON object with the following structure:

{
  "sentence": "<text>",
  "annotations": [
    {
      "symptom": "<symptom_name>",
      "severity": <0|1|2|3|null>,
      "label": <0|1>
    }
  ]
}

Fields:
sentence: The text of the sentence being annotated
annotations: List of symptom annotations for this sentence
  • symptom: The BDI-II symptom category (21 symptoms total)
  • severity: Severity level (0=none, 1=mild, 2=moderate, 3=severe, null=unassigned)
  • label: Relevance to symptom (0=not relevant, 1=relevant)


Important (balance of the labels):

(a) In the case of symptom detection, we balanced the labels in train/val splits to have the same number of relevants (1) and non-relevants (0).
(b) In the case of severity detection, the number of non-relevants in train/val splits are less. We included control sentences to balance the dataset alongside severity-labeled samples.

The BDI-Sen Dataset

BDI-Sen is a symptom-based dataset with relevant sentences that trace the presence of clinical symptoms. For this reason, we develop an annotation schema based on the BDI-II, a highly reliable tool to diagnose depression in clinical settings. The BDI-II covers 21 recognized symptoms, including emotional, cognitive and physical markers. Each item in the questionnaire has four alternative option responses scaled in severity from 0 (least severe) to 3 (most severe).

These options have a textual description associated. Table 1 provides an example of the option descriptions for the symptom Loss of energy. To create the BDI-Sen dataset, we used as data source the eRisk2019 depression severity collection, which contains social media users’ publications from Reddit and their responses to the BDI-II symptoms.

How to obtain the sentences

This collection can only be used for research purposes. To obtain the full collection that corresponds to our work, please fill in the following data agreement user agreement and send it to anxo.pvila@udc.es .