Statistical techniques
Descriptive and inferential
The intention of this page is to provide a summary of the statistical methods for the write-up of the statistical section. Hence, the following is just a list of statistical techniques with brief explanations of their applications. Please note that in order to apply these statistical techniques, a range of data manipulation techniques are required depending on the method to be used and we hope to include this in the near future. Although there are many other statistical techniques that can be applied, presented here are the commonly used techniques during consultations.
Due to continuous improvement for usability and demand, the prescription numbering may have changed in some sections. If you are coming back to refresh, we kindly advise that you to take note of the last revision date.
Due to continuous improvement for usability and demand, the prescription numbering may have changed in some sections. If you are coming back to refresh, we kindly advise that you to take note of the last revision date.
Statistical Data Analysis
Preparation
Dataset loading, merging, subsetting, tidying,
data science skills, etc
Statistical Data Analysis
Methods
Depends on either quantitative or qualitative data
Statistical Data Analysis
Software
Depends on data type, algorithm efficiency, output elegance, etc
0: Statistical data preparation - pre flight checks
0.1.1 - 0.2.2
0.1.1 Keep them separate until you meet the statistician.
0.1.2 In the case of Ms Excel spreadsheets, may have the datasets as tabs in a single workbook.
0.2.1 Must be short, preferably maximum 3 words but clearly explaining the meaning of the values under the column.
0.2.2 Can be standard abbreviations, if applicable.
0.2.3 - 0.2.5
0.2.3 Words must be separated by underscores, e.g. ‘cd4_count_group’
0.2.4 Must avoid special characters such as /, $, (, ), -, %, *, #, comma, etc.
0.2.5 Can not start with a number, e.g. ‘3rd_stage’. Preferred format is ‘stage_3’ or ‘stage3’.
Beyond points 0.1 and 0.2 above, there are usually many other data preparation tasks that require specialized data science skills.
0.3.1 - 0.3.4
0.3.1 Almost every dataset has some kind of issues. Be aware that the data cleaning process can be longer than expected.
0.3.2 The tasks are better handled by statistical programming.
0.3.3 This is now the data science world.
0.3.4 Contact us at Analytics Consultina from any part of the world and we will Zoom into your country, office, campus, etc.
1A1 - 1B2.4.3
1A1 Minimum and maximum
1A2 Quartiles and quartile range: median(Q1-Q3)
1A3 Mean, standard deviation and the coefficient of variation: mean±SD(CV)
1B1 In order to assess the mean or median difference of a numeric variable between two groups:
1B1.1 T-test is used for testing the mean difference of a normally distributed variable between two independent groups
1B1.2 Wilcoxon Rank sum test is used for testing the median difference of a non-normally distributed variable between two independent groups.
1B2 Assessing the mean or median difference of a numerical variable across at least three groups:
1B2.4 Post-hoc tests are conducted using the:
1B2.4.1 Following ANOVA parametric test – Tukey’s honestly significant difference (HSD) parametric test is used for single-step multiple mean comparison procedure should there be any significant difference in the means.
1B2.4.2 Following Kruskal Wallis non-parametric test – Dunn-test, non-parametric test is used for multiple median comparison procedure should there be any significant difference in the medians.
1B2.4.3 Following MANOVA test (a) Games-Howell test if variance homogeneity violated, (b) Tukey’s HSD if variance homogeneity met (c) etc.
1B2.1 - 1C4
1B2.1 ANOVA parametric test is used for testing the mean difference of a normally distributed single variable. (a) One-way i.e. with one categorical variable (b) r – way i.e. with r categorical variables.
1B2.2 Kruskal Wallis non-parametric test is used for testing the median difference of a non-normally distributed single variable. (a) One-way i.e. with one categorical variable (b) r – way i.e. with r categorical variables
1B2.3 MANOVA – usually parametric assumptions and at least two numeric variables (a) One-way i.e. with one categorical variable (b) r – way i.e. with r categorical variables.
1C Visual display of the numeric descriptive statistics (and in some cases annotated statistical tests) are presented in the form of:
1C1 Histograms – unidimensional
1C2 Scatterplots – two numeric variables (a) Gam, (b) LOESS and (c) Linear correlation + p-values + Regression line
1C3 (a) Box plots or (b) Violin plots – one numeric variable with one grouping categorical variable and one faceting categorical variable (c) Paired box plots – dependent samples (d) Multidimensional box plots – many numeric variables with one grouping categorical variable
1C4 Line graph – (a) ANOVA one way – one numeric variable with one categorical variable or (b) ANOVA two way – one numeric variable with two categorical variables.
1C5 - 1C11
1C5 Tukey pairwise – one numeric variable and one categorical variable with at least three categories.
1C6 Bar chart – one numeric variable and one categorical variable with possibility of faceting. Suitable when one of the categorical variables has an unusual higher number of levels. Summary measure (a) mean, (b) median, (c) mode and (d) CV
1C7 Dot plot – one numeric variable and one categorical variable + one more categorical variable for faceting. Suitable when one of the categorical variables has an unusual higher number of levels.
1C8 Slope graph – one numeric variable and two categorical variables. The two categorical variables can be initially supplied as numeric and internally transformed into categorical (5 classes by default and option to change). Suitable when one of the categorical variables has an unusual higher number of levels. Summary measure (a) mean, (b) median, (c) mode and (d) CV
1C9 Parallel plot – multidimensional numeric variables with one categorical variable. Summary measure (a) mean, (b) median, (c) mode and (d) CV
1C10 Density plot – multidimensional numeric variables (a) with or (b) without a grouping categorical variable.
1C11 Correlation plots – multidimensional numeric variables.
2A - 2B4
2A Categorical variables are described as counts and percentage frequencies.
2B The omnibus test for association between two categorical variables is determined by:
2B1 Chi square test for large expected frequencies (>5%)
2B2 Fisher’s exact test for small expected frequencies (<=5%)
2B3 In the case of significant omnibus test for r x 2 table, the post-hoc test for equal row wise proportions is also conducted by pairwise row wise proportional test using:
2B3.1 Chi square test for large expected frequencies (>5%)
2B3.2 Fisher’s exact test for small expected frequencies (<=5%)
2B4 McNemar test for paired test of nominal data – 2×2 table. That is, test for differences on a dichotomous dependent variable between two related/dependent groups
2C1 - 2C2.3
Visual display of the categorical descriptive statistics and in some cases annotated statistical tests are presented.
2C1 Pie chart – unidimensional or bivariate + faceting
2C2 Simple bar charts – unidimensional and multidimensional categorical variables
2C2.1 Simple bar chart – one categorical variables
2C2.2 Multiple bar chart – two categorical variables
2C2.3 Pareto chart – unidimensional (a) Standard (b) Standard + faceting (c) Unnested – row separation of multiple responses (d) Unnested – row separation of multiple responses + faceting.
2C3 - 2C7
2C3 Component bar chart – two categorical variables + faceting
2C4 Alluvial diagrams – multidimensional categorical variables. Equivalent to a cross tabulation of several variables. Show flows with category percentages.
2C5 Icon plots – multidimensional categorical variables. Equivalent to a cross tabulation of several variables. Simply visually show relative proportions as boxes with no percentages
2C6 (a) Likert bar plot and (b) Heatmap – multidimensional categorical variables. Suitable for Likert scale type questions.
2C7 Strength of association plots – multidimensional categorical variables. Correlation plots based on values from Cramer’s V, etc.
3: Inferential Statistics - Regression BASED
3A1 - 3A1.2
Regression based with model accuracy being key for prediction. Pre and post flight variable selection + stepwise
3A1 Categorical dependent variable
3A1.1 Binary logistic regression – dependent variable with 2 categories (a) without stepwise regression (b) with stepwise regression
3A1.1.1 Forest plot (a) Stepwise model (b) Full model
3A1.2 Multinomial logistic regression– dependent variable with at least 3 categories (a) without stepwise regression (b) with stepwise regression
3A1.2.1 - 3B2.2
3A1.2.1 Effects plot
3A1.3 Survival analysis – dependent variable with 2 categories and time to event (a) without stepwise regression (b) with stepwise regression
3A1.3.1 Kaplan Meier curves
3B2 Numerical dependent variable
3B2.1 Poisson – counts with typical Poisson distribution and relatively few zeros
3B2.1.1 Interaction plot
3B2.2 Quasi-poisson – counts with overdispersion
3B2.3 - 3C3.1
3B2.3 Hurdle, Negative binomial – counts having too many zeros
3B2.4 Gamma – counts >=1
3B2.5 Gaussian – normally distributed
3C3 Multivariate – at least two dependent variables analyzed simultaneously with other variables
3C3.1 Structural Equation Models (SEM)
(a) Exploratory Factor Analysis (EFA) (b) Confirmatory Factor Analysis (CFA)
And many more as may be required by the project objectives
4A1 - 4A2
4A1 Reliability analysis – Cronbach alpha
4A2 Inter-rater analysis – (a) ICC, (b) Kappa, (c) etc
4A3 - 4A4.1
4A3 Case-control analysis
4A4 Signal detection theory
4A4.1 Sensitivity and Specificity analysis
4A4.2
4A4.2 Optimal cut-off points based on (a) Youden’s index, (b) etc.
And many more as may be required by the project objectives
5: Statistical data analysis software
5A1
At Analytics Consultina we strongly believe in and use the most powerful statistical software in the world.
5A1 R Statistical Computing Software of the R Core Team, 2020, version 3.6.3.
All our macros are currently compatible with version 3.6.3
5A2-5A3
5A2 Statistical Package for Social Scientists (SPSS) for IBM. Currently using version 27.
5A3 Analysis of moment structures (AMOS) for IBM. Currently using version 27. AMOS is specifically developed for Structural Equation Models (SEM)
5A4-5B1
5A4 Statistics/Data Analysis (STATA)
5A5 Statistical Analysis System (SAS)
5B1 NVivo
And many more as may be required by the project objectives