The aim of this project is to investigate whether financial hardship is consistently associated with elevated mental health risk signals across heterogeneous open datasets. We integrate (mash-up) three Kaggle datasets representing students, professionals, and a broader population by harmonizing key variables (e.g., age group, dietary habits, family history, and a unified financial-hardship axis derived from financial stress or income). Because depression labels are not available in the general dataset, we use “history of mental illness” as a proxy indicator and compare trend directions rather than absolute prevalence across sources. The project emphasizes reproducibility, responsible data reuse, and transparent documentation of provenance, licensing, and ethical considerations, and it publishes processed outputs and visualizations via a one-page website with machine-readable metadata (DCAT-AP).
All three source datasets are obtained from Kaggle, which means their quality depends on how the original authors collected and curated the data. We therefore treat them as secondary, observational datasets and focus on transparency and robustness rather than “ground-truth” prevalence. We assess quality along:
(1) provenance (the datasets come from different contributors and sampling frames, so cross-dataset comparisons are limited)
(2) schema/consistency (we normalise Yes/No fields to 0/1 and harmonise key variables such as age groups and diet categories)
(3) missingness and coverage (some combinations of variables are absent, e.g., missing Medium bucket in the professional dataset after filtering)
(4) measurement validity (financial stress vs income measure different constructs; the general dataset lacks a depression label so we use “history of mental illness” as a proxy and compare trend directions only)
(5) stability of estimates (we publish only aggregated outputs and apply a minimum group-size threshold, reducing noise from small groups). These limitations are documented in the metadata and guide how we interpret results (directional trends rather than causal claims).
This project uses three public Kaggle CSV datasets representing students, professionals, and a broader population sample. Data are downloaded and read via scripts (e.g., using kagglehub) while keeping the raw files unchanged to ensure reproducibility. Because datasets differ in sampling frames and schemas, explicit mapping and harmonisation are required before comparison.
The three datasets are not collected from the same individuals and do not share a common unique identifier. Therefore, a row-level join would create artificial, invalid person-to-person matches. Instead, we apply a group-level mash-up: we compute comparable group statistics within each dataset and then combine the aggregated results for comparison.
Field names and representations vary across datasets. We harmonise them into a shared schema, including:
Age→age_group(binned categories)Dietary Habits→diet_group(Healthy/Moderate/Unhealthy)- Family mental health history →
family_history_flag(Yes/No → 1/0) - Outcome variables:
- student/professional:
depression_flag(Depression Yes/No → 1/0) - general: no depression label; we use
proxy_flag(History of Mental Illness → 1/0) as a proxy risk signal
- student/professional:
We perform preprocessing to improve computability and reduce noise:
- Boolean normalisation: Yes/No and similar values mapped to 1/0.
- Category normalisation: unify spelling/casing.
- Binning: convert continuous values into interpretable buckets:
age→age_group- financial stress (1–5) → Low/Medium/High (student/professional)
- income → terciles Low/Medium/High (general)
- Missingness & small groups: drop records with missing key dimensions and apply a minimum group size threshold (e.g.,
MIN_N=30) to reduce unstable estimates and potential misinterpretation.
A key technical challenge is that “High” does not mean the same thing across datasets:
- student/professional: High = high financial stress → worse conditions
- general: High = high income → better conditions
To avoid semantic mismatch, we define a unified
hardship_bucket(Low=better, High=worse): - student/professional:
hardship_bucket= stress_bucket - general: invert income buckets (low income → high hardship, high income → low hardship)
Within each dataset, we compute group statistics using shared grouping keys (e.g., age_group + diet_group + hardship_bucket + family_history_flag):
n: group sample sizerate: risk prevalence (mean of a 0/1 flag) This producesmashup_summary.csv, and a publication-ready versionmashup_summary_public.csvafter filtering small/invalid groups.
Because the general dataset uses a proxy outcome, we do not compare absolute prevalence across sources. Instead, we compare directional trends along the hardship axis (Low→Medium→High) and stratify by family_history_flag (0/1). We output trend plots to assess cross-source robustness.
We publish processed aggregates and visualisations:
mashup_summary_public.csv(publishable aggregate)- hardship trend plots (stratified by family history)
We export mashup_summary_public.csv as RDF (rdf/mashup_summary_public.ttl), where each row is modelled as a qb:Observation with stable hashed URIs. We also provide DCAT-AP-style metadata (rdf/dcat-ap.ttl) with direct-download distribution links, and a license declaration (rdf/license.ttl). All RDF artefacts are generated by scripts/make_rdf.py.
Dataset Nature & Ownership: The project uses three open datasets on depression from Kaggle: (1) General Depression Dataset by Anthony Therrien (413,768 synthetic individual records), (2) Depression Professional Dataset by user “ikynahidwin” (demographics and mental health of working professionals), and (3) Student Depression Dataset by Adil Shamim (27,902 student records). Each dataset is provided by its Kaggle contributor, who either collected or synthesized the data. For example, the general dataset is explicitly labeled as “synthetic”, meaning it was artificially generated to mirror realistic patterns without using real personal records. The student and professional datasets appear to be compiled from self-reported surveys (e.g. including factors like gender, lifestyle, stress levels, etc.). Regarding the different licenses, Depression Dataset is CC BY 4.0; Depression pro dataset is CC0 1.0 and the Student Depression Dataset is Apache 2.0.These licenses allow reuse, but with different requirements: CC BY 4.0 requires attribution, CC0 has no attribution requirement, and Apache 2.0 applies mainly to software but includes permissive terms with preservation of license notices. Our GitHub site will clearly list licenses, attributions, and source links for each dataset to ensure compliance and transparency.
Licensing & Compliance:The datasets are released under different open licenses: General Depression Dataset - CC BY 4.0, Depression Professional Dataset _ CC0 1.0 (Public Domain) and Student Depression Dataset: Apache 2.0 License.
All three allow reuse for non-commercial educational purposes with varying attribution requirements. Under CC BY 4.0, attribution must be given. Apache 2.0 allows reuse and modification under specific conditions including proper license inclusion. We will document these licenses clearly in our metadata and web portal. Since the data are open, our mash-up complies with open-data best practices regarding licensing and provenance documentation. We will include metadata (DCAT-AP) on our project site detailing each dataset’s origin, license, and any transformations applied, to uphold the requirement of transparent provenance and responsible data reuse.
Data Quality & Accuracy: In evaluating dataset quality, we note that the general depression dataset is synthetic. This has the advantage that no real individuals are represented (avoiding privacy issues), but it also means any insights drawn reflect the assumptions of the data generator rather than actual epidemiological statistics. We must be cautious in interpreting patterns from synthetic data – while it facilitates analysis (no missing values and controlled distributions), it may not perfectly capture real-world prevalence or correlations. The professional and student datasets are based on real survey responses (e.g. one includes factors such as “work pressure, job satisfaction, sleep duration, dietary habits, financial stress, work hours” and mental health indicators like depression status and suicidal thoughts. These are presumably self-reported measures, so they can contain biases or inaccuracies (e.g. underreporting of sensitive information, or subjective assessments of depression). There is also a lack of documentation about the sampling method – it’s unclear if these surveys targeted a specific region or were a convenience sample. This uncertainty affects data reliability and representativeness. We will address this by treating our findings as exploratory rather than definitive. Moreover, there is a slight heterogeneity in how “depression” is recorded: the general dataset does not have a direct depression label, so we use history of mental illness as a proxy outcome (a binary indicator) in that source. We recognize this is not directly equivalent to a clinical depression diagnosis, so in our cross-dataset comparison we will focus on trend directions rather than raw prevalence rates, to avoid misinterpreting the absolute differences that arise from different definitions and data collection processes.
Privacy and GDPR Compliance: Each dataset has been reviewed for personal or sensitive information. No direct identifiers (such as names, contact info, student IDs, etc.) are present – the data consists of anonymized attributes like age range, gender, lifestyle habits, and yes/no type mental health indicators. Especially given the sensitive nature of mental health information, we treat all data as potentially personal and sensitive. As such, we handle the inputs under a privacy-aware, precautionary approach aligned with Recital 26 of the GDPR. According to GDPR Recital 26, data that is rendered anonymous in such a way that individuals are no longer identifiable is not subject to GDPR provisions. In the synthetic dataset, no real person’s information is used at all, so it is entirely privacy-safe. In the student and professional datasets, while they represent real individuals’ responses, the data has been de-identified (no names or precise contact info) and is presented in aggregate form. We will treat these as anonymous statistical data used for research purposes. Importantly, we will not attempt any re-identification of individuals, and we combine datasets only on common factors (like categorical variables) without any key that could trace back to a person.
That said, we remain mindful that some combinations of demographics could pose a re-identification risk if the dataset were very granular (the so-called de-anonymization problem). For instance, if a record had very unique attributes (e.g. a 90-year-old student with certain rare conditions), one could potentially guess their identity. In our case, the student and professional datasets have large sample sizes (especially the student dataset with ~28k entries) and mostly general attributes, so the risk is low. Nonetheless, to mitigate any de-anonymization risk, we would apply standard anonymization techniques if needed – for example, generalizing or binning continuous variables (like converting precise ages into age groups). If necessary, sensitive features can be removed or masked, though at present the data fields are high-level enough (e.g. presence of financial stress: Yes/No) that individuals are not identifiable. We also note that health-related data (such as depression status) is considered a special category of personal data under GDPR, requiring stricter protection and legal basis for processing. Our use, however, is on anonymized data for statistical research, which is permissible. We ensure compliance by limiting our analysis to group-level insights and not publishing any information that could be tied back to an individual. In summary, all datasets are privacy-compliant as provided; we will reinforce this compliance by additional anonymization if any privacy gap is discovered, thereby avoiding any GDPR issues or ethical concerns about personal data misuse.
Avoiding Bias & Discrimination: In analyzing these datasets, we are conscious of potential biases in the data and in our interpretations. The goal is to draw insights about financial hardship and mental health without reinforcing stereotypes or unfair bias. We will examine the composition of each dataset – for example, the student dataset might over-represent a particular country or academic context, and the professional dataset might be skewed towards certain industries or age groups. Such biases in sampling can lead to data bias, where conclusions might not generalize to other populations. To mitigate this, we will explicitly acknowledge the demographic makeup of each dataset and avoid sweeping generalizations. Any findings will be contextualized (e.g. “within this student sample, financial stress correlates with higher self-reported depression symptoms” rather than implying this is universally true for all students). We also take care not to inadvertently discriminate or stigmatize. For instance, if one dataset shows a higher depression rate in a certain gender or income group, we will interpret this carefully – focusing on systemic or contextual factors rather than attributing any inherent trait to that group. Our analysis aims to be fair and human-centered, consistent with EU ethical AI guidelines that mandate avoiding unfair bias and prejudice in data-driven insights. We focus on within-dataset patterns, not population claims. We use the general dataset only as a proxy and compare trend direction, not absolute prevalence. We publish only aggregated outputs and apply MIN_N thresholds + binning (age groups, hardship levels). If we detect that an algorithm or visualization might paint a particular group in a negative light, we will reconsider our approach (for example, we might choose not to highlight a trivial difference that could be misconstrued as a stereotype).
Cognitive Bias and Interpretation: We are also wary of cognitive biases on the part of the researchers (ourselves). Given our hypothesis that financial hardship is linked to mental health risk, there is a risk of confirmation bias – selectively seeing what fits our hypothesis. To guard against this, we will employ a transparent methodology. We will also be open to findings that contradict our expectations. For example, if one dataset shows no strong link between income and depression, we will report that objectively and explore plausible reasons, rather than bending the narrative. Another cognitive bias to avoid is correlation ≠ causation: even if we find that those reporting financial stress have higher depression scores, we will not imply that poverty causes depression in a simple one-way manner. There could be other variables at play (family history, support systems, etc.), and the relationship is likely complex and bidirectional. Ethically, we must avoid simplistic explanations that could lead to blame or stigma – e.g. implying someone is depressed because they are financially irresponsible would be an unethical and unfounded conclusion. Instead, we focus on the structural observation that economic hardship and mental health challenges often co-occur, highlighting the need for support mechanisms, not blame.
Privacy and Dignity: Ethically, handling mental health data demands respecting the dignity and privacy of the individuals behind the data. Even though our datasets don’t include names, we remember that each row could represent a person’s sensitive experience (e.g. having had suicidal thoughts or a history of mental illness). We will ensure that our reporting remains empathetic and non-sensational. All results will be presented in aggregate form – for instance, we might say “X% of students with high financial stress also reported depressive symptoms” – and we will not single out any individual case. By aggregating and anonymizing, we both comply with privacy norms and uphold ethical standards, treating data subjects not as mere data points but as individuals whose information deserves protection. We also avoid any attempt to re-link or triangulate data to find out who they might be (which would be unethical and against data usage agreements).
Preventing Prejudice and Stigma: A key ethical consideration in mental health data is to avoid stigmatization. We acknowledge that depression and financial hardship are sensitive topics, often entangled with social prejudice. In our analysis and the way we communicate results, we will use respectful language and avoid terms that carry stigma. For example, we speak of “individuals experiencing depression” rather than labeling them pejoratively. We aim for the project website and visualizations to educate and inform without perpetuating myths about mental illness. If we compare different groups (students vs professionals, etc.), it will be to understand context, not to cast value judgments. Our ethical review also considers diversity and inclusion: the datasets use “gender” as a binary attribute (likely male/female). This is a problem because it does not think about people who're non-binary or transgender. We will write down these problems in our documentation. We know that the way we collect data can be unfair to some people. In this case it is unfair because it only thinks about two genders. We cannot change the data we already have. We will make sure to think about it in a way that includes everyone. For example we will not make statements about gender that ignore people who are not male or female. We will remember that gender is not about male and female and we will try to be inclusive when we talk about the data, from the Gender information. So when we are looking at cultural factors we have to be careful. If we do not consider all of these cultural factors we will not say that our findings apply to all ethnic or cultural groups. This is because ethnic or cultural factors can be very different from one group to another. We want to make sure that our findings about cultural factors are accurate, for each ethnic or cultural group.
Transparency and Accountability are very important to us. We want to be open about what we do. We will write down all the steps we take to clean and merge the data. This includes the decisions we make that can affect if things are fair. For example we will explain how we made sure the financial hardship indicator is the same, across all the datasets. If we do anything to the data that could make it biased like leaving out values or dealing with missing information we will tell you about it. We do this so other people can check our work and make sure we are doing things in a way. Transparency and Accountability are crucial because they allow others to look at our work and see if it is ethical. We also want to share our code with everyone under a license. This is the thing to do for open science and so others can repeat our work. By sharing our code we are asking people to look at it and tell us if we made any mistakes. We will also put a note on our website that says this work is for learning and it is not a doctors opinion about depression. This way people know what to expect. We do not try to give medical advice or say what will happen to specific people. We are doing this with our code and our analysis of depression. Our analysis of depression is, for educational purposes. So our project is really careful about peoples privacy. It does not treat anyone unfairly or differently. The project also handles information in a very responsible way. This is important because we want to be fair and honest about how we use peoples information. We make sure that our project respects privacy avoids bias and discrimination and handles data responsibly which is what the principles of fairness and accountability in data ethics are all about. Our project follows these principles of fairness and accountability, in data ethics to do the thing.
By proactively addressing these issues – from data collection through to interpretation and publication – we aim to produce insights that are not only interesting and reproducible but also ethical and respectful towards the individuals and communities the data represent.
Our source datasets come from Kaggle, so updates are outside our control and the data may change or be removed over time. To keep the project maintainable, we designed a reproducible pipeline: whenever the Kaggle sources are updated (or before a future re-release), we can re-download the datasets using the same slugs and re-run our scripts to regenerate the processed outputs (mashup_summary_public.csv, trend_summary.csv), visualisations, and the RDF artefacts (mashup_summary_public.ttl, dcat-ap.ttl, license.ttl). Each run can be documented with a date/tag so results remain traceable even if upstream data changes later. If a source dataset changes schema or license, we will update the mapping/metadata accordingly and, if needed, restrict publication to aggregated outputs only.