Function to describe the distribution of a discrete variable from complex survey data.
It produces a list containing a table, including the confidence intervals of the indicators, a ready-to-be published ggplot graphic and, if proportions for H0 are specified, a Chi-Square statistical test (using survey::svygofchisq()). The confidence intervals and the statistical test are taking into account the complex survey design. In case of facets, no statistical test is (yet) computed.
Exporting those results to an Excell file is possible.
Usage
distrib_discrete(
data,
quali_var,
facet = NULL,
filter_exp = NULL,
...,
na.rm.facet = TRUE,
na.rm.var = TRUE,
probs = NULL,
prop_method = "beta",
reorder = FALSE,
show_ci = TRUE,
show_n = FALSE,
show_value = TRUE,
show_labs = TRUE,
scale = 100,
digits = 0,
unit = "%",
dec = NULL,
col = "sienna2",
pal = NULL,
dodge = 0.9,
font = "Roboto",
wrap_width_y = 25,
title = NULL,
subtitle = NULL,
xlab = NULL,
ylab = NULL,
caption = NULL,
lang = "fr",
theme = "fonctionr",
coef_font = 1,
export_path = NULL
)
distrib_d(...)Arguments
- data
A dataframe or an object from the survey package or an object from the srvyr package.
- quali_var
The discrete variable to be described.
- facet
A variable defining the faceting group.
- filter_exp
An expression filtering the data, preserving the design. Notice that
filter_expworks assrvyr::filter(): it excludes observations for whichfilter_expresults intoNA.It is often the case whenNAis present on one of the filter variables.- ...
All options possible in
srvyr::as_survey_design().- na.rm.facet
TRUEif you want to remove observations withNAon the facet variable.FALSEif you want to create a facet with theNAvalues for the facet variable. Default isTRUE.- na.rm.var
TRUEif you want to remove observations withNAon the discrete variable.FALSEif you want to create a modality withNAvalues for the discrete variable. Default isTRUE.- probs
Vector of probabilities for H0 of the statistical test, in the correct order (will be rescaled to sum to 1). If
probs = NULL, no statistical test is performed. Default isNULL.- prop_method
Type of proportion method used to compute confidence intervals. See
survey::svyciprop()for details. Default is beta method.- reorder
TRUEif you want to reorder the groups according to the proportion.NAvalue, ifna.rm.var = FALSE, is not included in the reorder. In case of facets, the categories are reordered based on each median category Default isFALSE.- show_ci
TRUEif you want to show the error bars on the graphic.FALSEif you don't want to show the error bars. Default isTRUE.- show_n
TRUEif you want to show on the graphic the number of observations in the sample in each category.FALSEif you don't want to show this number. Default isFALSE.- show_value
TRUEif you want to show the proportions in each category on the graphic.FALSEif you don't want to show the proportion. Default isTRUE.- show_labs
TRUEif you want to show axes labels.FALSEif you do not want to show any label on axes. Default isTRUE.- scale
Denominator of the proportion. Default is
100to interprets numbers as percentages.- digits
Number of decimal places displayed on the values labels on the graphic. Default is
0.- unit
Unit displayed on the graphic. Default is
"%".- dec
Decimal mark displayed on the graphic. Default depends on lang:
","for fr and nl ;"."for en.- col
Color of the bars. col must be a R color or an hexadecimal color code. Default is
"sienna2". The color ofNAcategory (in case ofna.rm.var = FALSE) is always"grey".- pal
Argument kept for compatibility with old versions.
- dodge
Width of the bars. Default is
0.9to let a small space between bars. A value of1leads to no space betweens bars. Values higher than1are not advised because they cause an overlaping of the bars.- font
Font used in the graphic. See
load_and_active_fonts()for available fonts. Default is"Roboto".- wrap_width_y
Number of characters before going to the line for the labels of the categories. Default is
25.- title
Title of the graphic.
- subtitle
Subtitle of the graphic.
- xlab
X label on the graphic. As
ggplot2::coord_flip()is used in the graphic,xlabrefers to the x label on the graphic, after theggplot2::coord_flip(), and not to the x variable in the data. Default (xlab = NULL) displays "Distribution (total : 100 pourcent)" (iflang = "fr"), "Distribution (total: 100 percent)" (iflang = "en") or "Distributie (totaal : 100 procent)" (iflang = "nl"). To show no X label, usexlab = "".- ylab
Y label on the graphic. As
ggplot2::coord_flip()is used in the graphic,ylabrefers to the Y label on the graphic, after theggplot2::coord_flip(), and not to the Y variable in the data. Default (ylab = NULL) displays the name of the discrete variable (quali_var). To show no Y label, useylab = "".- caption
Caption of the graphic. This caption goes under de default caption showing the result of the statistical test (if any).
- lang
Language of the indications on the graphic. Possibilities are
"fr"(french),"nl"(dutch) and"en"(english). Default is"fr".- theme
Theme of the graphic. Default is
"fonctionr"."IWEPS"adds y axis lines and ticks.NULLuses the default grey ggplot2 theme.- coef_font
A multiplier factor for font size of all fonts on the graphic. Default is
1. Usefull when exporting the graphic for a publication (e.g. in a Quarto document).- export_path
Path to export the results in an xlsx file. The file includes two or three sheets : the table, the graphic and the statistical test (if
probsis notNULL).
Examples
# Loading of data
data(eusilc, package = "laeken")
# Recoding eusilc$pl030 into eusilc$pl030_rec
eusilc$pl030_rec <- NA
eusilc$pl030_rec[eusilc$pl030 == "1"] <- "Working full time"
eusilc$pl030_rec[eusilc$pl030 == "2"] <- "Working part time"
eusilc$pl030_rec[eusilc$pl030 == "3"] <- "Unemployed"
eusilc$pl030_rec[eusilc$pl030 == "4"] <- "Student"
eusilc$pl030_rec[eusilc$pl030 == "5"] <- "Retired"
eusilc$pl030_rec[eusilc$pl030 == "6"] <- "Permanently disabled"
eusilc$pl030_rec[eusilc$pl030 == "7"] <- "Fulfilling domestic tasks"
# Computation, taking sample design into account
eusilc_dist_group_d <- distrib_d(
eusilc,
pl030_rec,
strata = db040,
ids = db030,
weight = rb050,
title = "Distribution of socio-economic status",
subtitle = "Example with austrian SILC data from 'laeken' package"
)
#> Input: data.frame
#> Sampling design -> ids: db030, strata: db040, weights: rb050
#> Numbers of observation(s) removed by each filter (one after the other):
#> 2720 observation(s) removed due to missing quali_var
# Results in graph form
eusilc_dist_group_d$graph
# Results in table format
eusilc_dist_group_d$tab
#> # A tibble: 7 × 8
#> pl030_rec prop prop_low prop_upp n_sample n_weighted n_weighted_low
#> <fct> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 Fulfilling domest… 0.0948 0.0899 0.0998 1207 640311. 605978.
#> 2 Permanently disab… 0.0155 0.0129 0.0186 178 104930. 85796.
#> 3 Retired 0.267 0.258 0.277 3146 1806954. 1746273.
#> 4 Student 0.0586 0.0544 0.0630 736 395829. 365532.
#> 5 Unemployed 0.0449 0.0411 0.0489 518 303252. 276953.
#> 6 Working full time 0.425 0.416 0.434 5162 2869868. 2797833.
#> 7 Working part time 0.0941 0.0890 0.0995 1160 636121. 600709.
#> # ℹ 1 more variable: n_weighted_upp <dbl>