Title: | Miscellaneous Descriptive Functions |
---|---|
Description: | Helper functions for descriptive tasks such as making print-friendly bivariate tables, sample size flow counts, and visualizing sample distributions. Also contains 'R' approximations of some common 'SAS' and 'Stata' functions such as 'PROC MEANS' from 'SAS' and 'ladder', 'gladder', and 'pwcorr' from 'Stata'. |
Authors: | Craig McGowan [aut, cre] |
Maintainer: | Craig McGowan <[email protected]> |
License: | GPL-3 |
Version: | 0.1.0 |
Built: | 2024-10-26 05:42:11 UTC |
Source: | https://github.com/craigjmcgowan/describedata |
Descriptive statistics for categorical variables as well as normally and non-normally distributed continuous variables, split across levels of a categorical variable. Depending on the variable type, an appropriate statistical test is used to assess differences across levels of the comparison variable.
bivariate_compare(df, compare, normal_vars = NULL, non_normal_vars = NULL, cat_vars = NULL, display_round = 2, p = TRUE, p_round = 4, include_na = FALSE, col_n = TRUE, cont_n = FALSE, all_cont_mean = FALSE, all_cont_median = FALSE, iqr = TRUE, fisher = FALSE, workspace = NULL, var_order = NULL, var_label_df = NULL)
bivariate_compare(df, compare, normal_vars = NULL, non_normal_vars = NULL, cat_vars = NULL, display_round = 2, p = TRUE, p_round = 4, include_na = FALSE, col_n = TRUE, cont_n = FALSE, all_cont_mean = FALSE, all_cont_median = FALSE, iqr = TRUE, fisher = FALSE, workspace = NULL, var_order = NULL, var_label_df = NULL)
df |
A data.frame or tibble. |
compare |
Discrete variable. Separate statistics will be produced for each level, with statistical tests across levels. Must be quoted. |
normal_vars |
Character vector of normally distributed continuous variables that will be included in the descriptive table. |
non_normal_vars |
Character vector of non-normally distributed continuous variables that will be included in the descriptive table. |
cat_vars |
Character vector of categorical variables that will be included in the descriptive table. |
display_round |
Number of decimal places displayed values should be rounded to |
p |
Logical. Should p-values be calculated and displayed?
Default |
p_round |
Number of decimal places p-values should be rounded to. |
include_na |
Logical. Should |
col_n |
Logical. Should the total number of observations be displayed
for each column? Default |
cont_n |
Logical. Display sample n for continuous variables in the
table. Default |
all_cont_mean |
Logical. Display mean (sd) for all continuous variables.
Default |
all_cont_median |
Logical. Display median (sd) for all continuous variables.
Default |
iqr |
Logical. If the median is displayed for a continuous variable, should
interquartile range be displayed as well ( |
fisher |
Logical. Should Fisher's exact test be used for categorical
variables? Default |
workspace |
Numeric variable indicating the workspace to be used for
Fisher's exact test. If |
var_order |
Character vector listing the variable names in the order
results should be displayed. If |
var_label_df |
A data.frame or tibble with columns "variable" and
"label" that contains display labels for each variable specified in
|
Statistical differences between normally distributed continuous variables
are assessed using aov()
, differences in non-normally distributed
variables are assessed using kruskal.test()
, and differences in
categorical variables are assessed using chisq.test()
by default,
with a user option for fisher.test()
instead.
A data.frame with columns label, overall, a column for each level
of compare
, and p.value. For normal_vars
, mean (SD) is
displayed, for non_normal_vars
median (IQR) is displayed, and for
cat_vars
n (percent) is displayed. For p values on continuous
variables, a superscript 'a' denotes the Kruskal-Wallis test was used
bivariate_compare(iris, compare = "Species", normal_vars = c("Sepal.Length", "Sepal.Width")) bivariate_compare(mtcars, compare = "cyl", non_normal_vars = "mpg")
bivariate_compare(iris, compare = "Species", normal_vars = c("Sepal.Length", "Sepal.Width")) bivariate_compare(mtcars, compare = "cyl", non_normal_vars = "mpg")
Internal function to calculate pairwise correlations and return p values
cor.prob(df)
cor.prob(df)
df |
A data frame or tibble. |
A data.frame with columns h_var, v_var, and p.value
The helpR package contains descriptive functions for tasks such as making print-friendly bivariate tables, sample size flow counts, and more. It also contains R approximations of some common, useful SAS/Stata functions.
The helper functions bivariate_compare
and
univar_freq
create frequency tables. univar_freq
produces simple n and percent for categories of a single variable,
while bivariate_compare
compares continuous or categorical
variables across categories of a comparison variable. This is particularly
useful for generating a Table 1 or 2 for a publication manuscript.
sample_flow
produces tables illustrating how final sample
size is determined and the number of participants excluded by each
exclusion criteria.
nagelkerke
calculates the Nagelkerke pseudo r-squared for a
logistic regression model.
ladder
, gladder
, and pwcorr
are
approximate replicas of the respective Stata functions. Not all
functionality is currently incorporated. stata_tidy
reformats R model output to a format similar to Stata.
proc_means
is an approximate replica of the respective SAS
function. Not all functionality is currently incorporated.
Creates ladder-of-powers histograms to visualize nine common transformations and compare each to a normal distribution. The following transformations are included: identity, cubic, square, square root, natural logarithm, inverse square root, inverse, inverse square, and inverse cubic.
gladder(x)
gladder(x)
x |
A continuous numeric vector. |
A ggplot object with plots of each transformation
gladder(iris$Sepal.Length) gladder(mtcars$disp)
gladder(iris$Sepal.Length) gladder(mtcars$disp)
Searches the ladder of powers histograms to find a transformation to make
x
normally distributed. The Shapiro-Wilkes test is used to assess for
normality. The following transformations are included: identity, cubic,
square, square root, natural logarithm, inverse square root, inverse,
inverse square, and inverse cubic.
ladder(x)
ladder(x)
x |
A continuous numeric vector. |
A data.frame
ladder(iris$Sepal.Length) ladder(mtcars$disp)
ladder(iris$Sepal.Length) ladder(mtcars$disp)
Calculate Nagelkerke pseudo r-squared from a fitted model object.
nagelkerke(mod)
nagelkerke(mod)
mod |
A |
Numeric value of Nagelkerke r-squared for the model
Plots a simple density histogram for a continuous variable with a normal distribution overlaid. The overlaid normal distribution has the same mean and standard deviation as the provided variable, and the plot provides a visual means to assess the normality of the variable's distribution.
norm_dist_plot(df, vars)
norm_dist_plot(df, vars)
df |
A data.frame or tibble. |
vars |
A character vector of continuous variable names. |
A ggplot
object.
norm_dist_plot(df = iris, vars = "Sepal.Width") norm_dist_plot(df = iris, vars = c("Sepal.Width", "Sepal.Length"))
norm_dist_plot(df = iris, vars = "Sepal.Width") norm_dist_plot(df = iris, vars = c("Sepal.Width", "Sepal.Length"))
Descriptive statistics for continuous variables, with the option of stratifying by a categorical variable.
proc_means(df, vars = NULL, var_order = NULL, by = NULL, n = T, mean = TRUE, sd = TRUE, min = TRUE, max = TRUE, median = FALSE, q1 = FALSE, q3 = FALSE, iqr = FALSE, nmiss = FALSE, nobs = FALSE, p = FALSE, p_round = 4, display_round = 3)
proc_means(df, vars = NULL, var_order = NULL, by = NULL, n = T, mean = TRUE, sd = TRUE, min = TRUE, max = TRUE, median = FALSE, q1 = FALSE, q3 = FALSE, iqr = FALSE, nmiss = FALSE, nobs = FALSE, p = FALSE, p_round = 4, display_round = 3)
df |
A data frame or tibble. |
vars |
Character vector of numeric variables to generate descriptive
statistics for. If the default ( |
var_order |
Character vector listing the variable names in the order
results should be displayed. If the default ( |
by |
Discrete variable. Separate statistics will be produced for
each level. Default |
n |
logical. Display number of rows with values. Default |
mean |
logical. Display mean value. Default |
sd |
logical. Display standard deviation. Default |
min |
logical. Display minimum value. Default |
max |
logical. Display maximum value. Default |
median |
logical. Display median value. Default |
q1 |
logical. Display first quartile value. Default |
q3 |
logical. Display third quartile value. Default |
iqr |
logical. Display interquartile range. Default |
nmiss |
logical. Display number of missing values. Default |
nobs |
logical. Display total number of rows. Default |
p |
logical. Calculate p-value across |
p_round |
Number of decimal places p-values should be rounded to. |
display_round |
Number of decimal places displayed values should be rounded to |
A data.frame with columns variable, by
variable, and
a column for each summary statistic.
proc_means(iris, vars = c("Sepal.Length", "Sepal.Width")) proc_means(iris, by = "Species")
proc_means(iris, vars = c("Sepal.Length", "Sepal.Width")) proc_means(iris, by = "Species")
Calculate and return a matrix of pairwise correlation coefficients. Returns
significance levels if method == "pearson"
pwcorr(df, vars = NULL, method = "pearson", var_label_df = NULL)
pwcorr(df, vars = NULL, method = "pearson", var_label_df = NULL)
df |
A data.frame or tibble. |
vars |
A character vector of numeric variables to generate pairwise
correlations for. If the default ( |
method |
One of |
var_label_df |
A data.frame or tibble with columns "variable" and
"label" that contains display labels for each variable specified in
|
A data.frame displaying the pairwise correlation coefficients
between all variables in vars
.
Generate a table illustrating sequential exclusion from an analytical sample due to user specified exclusions.
sample_flow(df, exclusions = c())
sample_flow(df, exclusions = c())
df |
A data.frame or tibble. |
exclusions |
Character vector of logical conditions indicating which rows should be excluded from the final sample. Exclusions occur in the order specified. |
A data.frame with columns Exclusion, 'Sequential Excluded', and 'Total Excluded' for display.
Create a display data frame similar to Stata model output for a fitted R model.
stata_tidy(mod, var_label_df = NULL)
stata_tidy(mod, var_label_df = NULL)
mod |
A fitted model object |
var_label_df |
A data.frame or tibble with columns "variable" and
"label" that contains display labels for each variable in |
A data.frame with columns term and display
Descriptive statistics (N,
univar_freq(df, var, na.rm = FALSE)
univar_freq(df, var, na.rm = FALSE)
df |
A data frame or tibble. |
var |
A discrete, numeric variable. |
na.rm |
logical. Should missing values (including |
A data.frame with columns var
, NObs, and Percent
univar_freq(iris, var = "Species") univar_freq(mtcars, var = "cyl")
univar_freq(iris, var = "Species") univar_freq(mtcars, var = "cyl")