Current research projects
On this page you find an overview of the current research work of the chair. The working papers can be sent on request.
Small area prediction of counts under machine learning-type mixed models
Frink, N.; Schmid, T.
Abstract: This paper proposes small area estimation methods that utilize generalized tree-based machine learning techniques to improve the estimation of disaggregated means in small areas using discrete survey data. Specifically, we present two approaches based on random forests: the Generalized Mixed Effects Random Forest (GMERF) and a Mixed Effects Random Forest (MERF), both tailored to address challenges associated with count outcomes, particularly overdispersion. Our analysis reveals that the MERF, which does not assume a Poisson distribution to model the mean behavior of count data, excels in scenarios of severe overdispersion. Conversely, the GMERF performs best under conditions where Poisson distribution assumptions are moderately met. Additionally, we introduce and evaluate three bootstrap methodologies - one parametric and two non-parametric - designed to assess the reliability of point estimators for area-level means. The effectiveness of these methodologies is tested through model-based (and design-based) simulations and applied to a real-world dataset from the state of Guerrero in Mexico, demonstrating their robustness and potential for practical applications.
For further information please click on the link
Small area estimation with generalized random forests: Estimating poverty rates in Mexico
Frink, N.; Schmid, T.
Abstract: Identifying and addressing poverty is challenging in administrative units with limited information on income distribution and well-being. To overcome this obstacle, small area estimation methods have been developed to provide reliable and efficient estimators at disaggregated levels, enabling informed decision-making by policymakers despite the data scarcity. From a theoretical perspective, we propose a robust and flexible approach for estimating poverty indicators based on binary response variables within the small area estimation context: the generalized mixed effects random forest. Our method employs machine learning techniques to identify predictive, non-linear relationships from data, while also modeling hierarchical structures. Mean squared error estimation is explored using a parametric bootstrap. From an applied perspective, we examine the impact of information loss due to converting continuous variables into binary variables on the performance of small area estimation methods. We evaluate the proposed point and uncertainty estimates in both model- and design-based simulations. Finally, we apply our method to a case study revealing spatial patterns of poverty in the Mexican state of Tlaxcala.
For further information please click on the link
- Estimation of the consumer price index with regional weights using small area estimation methods: A case study of Germany
Lee, Y.
Abstract: The consumer price index (CPI) is an important indicator for formulating effective economic policies. Most countries produce a national CPI, while some also publish a sub-national (regional) CPI. In the latter case, states sometimes use national product weights, which do not adequately represent the importance of products at the regional level. An ideal regional CPI uses regional product weights to accurately reflect regional specifics. In this study, I explore the estimation of a regional CPI with regional weights by using an income and consumption survey from Germany. To obtain reliable regional weights, I focus on the estimation of regional expenditures for each product. Estimating regional expenditures is challenging because of small sample sizes, which potentially produce unreliable estimates. I address this problem by using small area estimation models, and I show how a model-based estimation of regional expenditures substantially improves the reliability of estimation in Germany.
- Estimating Disaggregated Mobility Indicators using Area-Level Models on Survey Data
Mühlbauer, M.
Abstract: Statistical indicators are traditionally estimated from survey data using direct estimation methods. Large-scale mobility surveys, such as Mobilität in Deutschland 2017 (MiD) (Mobility in Germany 2017), are often designed to produce reliable statistical indicators at the level of specific subpopulations defined by geographical regions (e.g., states, districts, counties) or other relevant criteria, such as demographic characteristics (e.g., age, gender). After the data have been collected, there is often a need for indicators at a lower geographical level, for which direct estimation may not provide a satisfactory degree of precision due to insufficient sample sizes. Area-level Small Area Estimation (SAE) models exploit correlations between the dependent variable and auxiliary data to provide sufficiently precise estimates at the desired disaggregated level of interest. This paper demonstrates the application of SAE in the context of mobility research using a (transformed) area-level FayHerriot model to estimate district-level transport activity measured in Mean Trip Kilometers (MTK) using MiD data. lt answers the question of how reliable disaggregated estimates of MTK and potentially other metric mobility indicators can be estimated using this methodology. Covariates are obtained by aggregating data from MiD and the extensive Infas 360 CASA dataset. In a second step, they are selected using a multi-stage procedure incorporating the LASSO regularization technique. The variances of the direct estimates play an important role in the Fay-Herriot model and are estimated using a bootstrap which is calibrated on a set of MiD totals. Two distributional assumptions, the Gaussianity of the random effects and the residuals, are made in the modeling process. As the data do not fulfill these assumptions, a logarithrnic transformation is applied, which provides a better fit to normality. Tue results show a significant spread of the mean trip length over the German districts. The densely populated districts of North Rhine-Westphalia have the shortest average trips, while districts in the northeastern states of Germany are characterized by significantly longer trips.
The estimation of poverty indicators using mixed effects random forests: case study for the Mexican state of Veracruz
Krennmair, P.; Schmid, T.; Tzavidis, N.
Abstract: Mapping and analysing the spatial concentration of poverty is imperative for evidence-based policies to translate into inclusive and sustainable actions. The use of national sample surveys to obtain detailed and reliable estimates for poverty indicators on disaggregated geographical and other domains (e.g. demographic groups) imposes a methodological challenge. Small Area Estimation is a collective term for (model-based) procedures, which combine survey data with existing auxiliary information (e.g. census or administrative data) using predictive models to estimate domain-specific statistical indicators. We propose the use of mixed effects random forests as flexible, robust, and reliable method to produce domain-specific cumulative distribution functions from which (non-linear) poverty estimators can be obtained. This paper is driven by our aim to inform a transparent and steady discussion on current methodological improvements for Small Area Estimation, such as the use of (tree-based) machine learning methods and their contribution to recent requirements for poverty assessment. We evaluate proposed point and uncertainty estimators in a design-based simulation and focus on a case study uncovering spatial patterns of poverty for the Mexican state of Veracruz.
Analysing opportunity cost of care work using mixed effects random forests under aggregated census data
Krennmair, P.; Würz, N.; Schmid, T.
Abstract: Reliable estimators of the spatial distribution of socio-economic indicators are essential for evidence-based policy-making. As sample sizes are small for highly disaggregated domains, the accuracy of the direct estimates is reduced. To overcome this problem small area estimation approaches are promising. In this work we propose a small area methodology using machine learning methods. The semi-parametric framework of mixed effects random forest combines the advantages of random forests (robustness against outliers and implicit model-selection) with the ability to model hierarchical dependencies. Existing random forest-based methods require access to auxiliary information on population-level. We present a methodology that deals with the lack of population micro-data. Our strategy adaptively incorporates aggregated auxiliary information through calibration-weights - based on empirical likelihood - for the estimation of area-level means. In addition to our point estimator, we provide a non-parametric bootstrap estimator measuring its uncertainty. The performance of the proposed point estimator and its uncertainty measure is studied in model-based simulations. Finally, the proposed methodology is applied to the 2011 Socio-Economic Panel and aggregate census information from the same year to estimate the average opportunity cost of care work for 96 regional planning regions in Germany.
The R package saeTrafo for estimating unit-level small area models under transformations
Würz, N.
Abstract: The R package saeTrafo provides new statistical methodology for the estimation of small area means using unit-level models under transformations. The method of Würz et al. (2022, JRSSA) enables the use of unit-level models dealing with both limited auxiliary data (often the only source of data due to confidentiality agreements) and skewed distributed dependent variables like income (by using transformations such as the log or data-driven log-shift). In addition to the implementation of the new methodology, saeTrafo provides established methods for unitlevel models under transformations, allowing further applications and comparisons. It is of advantage that the most suitable method is automatically selected and uncertainty estimates are easily offered. In addition, tools for creating plots (model validation and estimator evaluation), visualisation on maps and exporting to Excel and OpenDocument Spreadsheets are provided. The functionalities of the package are demonstrated with exemplary data based on Austrian income and living conditions.
- Releasing Survey Microdata with Exact Cluster Locations and Additional Privacy Safeguards
Koebe, T.; Arias-Salazar, A.; Schmid, T.
Abstract: Household survey programs around the world publish fine-granular georeferenced microdata to support research on the interdependence of human livelihoods and their surrounding environment. To safeguard the respondents’ privacy, micro-level survey data is usually (pseudo)-anonymised through deletion or perturbation procedures such as obfuscating the true location of data collection. This, however, poses a challenge to emerging approaches that augment survey data with auxiliary information on a local level. Here, we propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards through synthetically generated data using generative models. We back our proposal with experiments using data from the 2011 Costa Rican census and satellite-derived auxiliary information. Our strategy reduces the respondents’ re-identification risk for any number of disclosed attributes by 60-80% even under re-identification attempts.
- A framework for producing small area estimates based on area-level models in R
Harmening, S.; Kreutzmann, A.-K.; Pannier, S.; Salvati, N.; Schmid, T.
Abstract: The R package emdi facilitates the estimation of regionally disaggregated indicators using small area estimation methods and provides tools for model building, diagnostics, presenting, and exporting the results. The package version 1.1.7 includes unit-level small area models that rely on access to micro data which may be challenging due to confidentiality constraints. In contrast, area-level models are less demanding with respect to (a) data requirements, as only aggregates are needed for estimating regional indicators, and (b) computational resources, and enable the incorporation of design-based properties. Therefore, the area-level model (Fay and Herriot 1979) and various extensions have been added to version 2.0.2 of the package emdi. These extensions include amongst others (a) transformed area-level models with back-transformations, (b) spatial and robust extensions, (c) adjusted variance estimation methods, and (d) area-level models that account for measurement errors. Corresponding mean squared error estimators are implemented for assessing the uncertainty. User-friendly tools like a stepwise variable selection function, model diagnostics, benchmarking options, high quality maps and export options of the results enable the user a complete analysis procedure - from model building to diagnostics. The functionality of the package is demonstrated by illustrative examples based on synthetic data for Austrian districts.
- Estimating regional unemployment with mobile network data for functional urban areas in Germany
Hadam, S.; N. Würz; Kreutzmann, A.-K.; Schmid, T.
Abstract: The ongoing growth of cities due to better job opportunities is leading to increased labour-related commuter flows in several countries. On the one hand, an increasing number of people commute and move to the cities, but on the other hand, the labour market indicates higher unemployment rates in urban areas than in the surrounding areas. We investigate this phenomenon on regional level by an alternative definition of unemployment rates in which commuting behaviour is integrated. We combine data from the labour force survey with dynamic mobile network data by small area models for the federal state North Rhine-Westphalia in Germany. From a methodical perspective, we use a transformed Fay-Herriot model with bias correction for the estimation of unemployment rates and propose a parametric bootstrap for the mean squared error estimation that includes the bias correction. The performance of the proposed methodology is evaluated in a case study based on official data and in model-based simulations. The results in the application show that unemployment rates (adjusted by commuters) in German cities are lower than traditional official unemployment rates indicate.
- Scale estimation and data-driven tuning constant selection for M-quantile regression
Dwaber, J.; Salvati, N.; Schmid, T.; Tzavidis, N.
Abstract: M-quantile regression is a general form of quantile-like regression which usually utilises the Huber influence function and corresponding tuning constant. Estimation requires a nuisance scale parameter to ensure the M-quantile estimates are scale invariant, with several scale estimators having previously been proposed. In this paper we assess these scale estimators and evaluate their suitability, as well as proposing a new scale estimator based on the method of moments. Further, we present two approaches for estimating data-driven tuning constant selection for M-quantile regression. The tuning constants are obtained by i) minimising the estimated asymptotic variance of the regression parameters and ii) utilising an inverse M-quantile function to reduce the effect of outlying observations. We investigate whether data-driven tuning constants, as opposed to the usual fixed constant, for instance, at c=1.345, can improve the efficiency of the estimators of M-quantile regression parameters. The performance of the data-driven tuning constant is investigated in different scenarios using model-based simulations. Finally, we illustrate the proposed methods using a European Union Statistics on Income and Living Conditions data set.
- Asymptotic distribution of regression quantiles in a mixed effects model
Hensel, S.; Pannier, S.; Schmid, T.; Tzavidis, N.
Abstract: Linear quantile models allow for a robust analysis of the conditional distribution of the variable of interest. The introduction of a random effects term extended their range of application to data with complex dependency structures, as they occur in many studies. This paper proposes a higher theoretical understanding of linear quantile mixed models by analysing the asymptotic behaviour of the corresponding maximum likelihood estimator. We will proof the estimators to be consistent and show that it is asymptotically normally distributed. Additionally, a plug-in variance estimator is derived, and its finite sample behaviour is demonstrated in a simulation study.