Statistical techniques

Descriptive and inferential

The intention of this page is to provide a summary of the statistical methods for the write-up of the statistical section. Hence, the following is just a list of statistical techniques with brief explanations of their applications. Please note that in order to apply these statistical techniques, a range of data manipulation techniques are required depending on the method to be used and we hope to include this in the near future. Although there are many other statistical techniques that can be applied, presented here are the commonly used techniques during consultations.

Due to continuous improvement for usability and demand, the prescription numbering may have changed in some sections. If you are coming back to refresh, we kindly advise that you to take note of the last revision date

Due to continuous improvement for usability and demand, the prescription numbering may have changed in some sections. If you are coming back to refresh, we kindly advise that you to take note of the last revision date

Statistical Data Analysis
Preparation

Dataset loading, merging, subsetting, tidying,
data science skills, etc

Statistical Data Analysis
Methods

Depends on either quantitative or qualitative data

Statistical Data Analysis
Software

Depends on data type, algorithm efficiency, output elegance, etc

0: Statistical data preparation - pre flight checks

0.1.1 - 0.2.2

0.1.1 Keep them separate until you meet the statistician.

0.1.2 In the case of Ms Excel spreadsheets, may have the datasets as tabs in a single workbook.

0.2.1 Must be short, preferably maximum 3 words but clearly explaining the meaning of the values under the column.

0.2.2 Can be standard abbreviations, if applicable.

0.2.3 - 0.2.5

0.2.3 Words must be separated by underscores, e.g. ‘cd4_count_group’

0.2.4 Must avoid special characters such as /, $, (, ), -, %, *,  #, comma, etc.

0.2.5 Can not start with a number, e.g. ‘3rd_stage’. Preferred format is ‘stage_3’ or ‘stage3’.

Beyond points 0.1 and 0.2 above, there are usually many other data preparation tasks that require specialized data science skills.

 

0.3.1 - 0.3.4

0.3.1 Almost every dataset has some kind of issues. Be aware that the data cleaning process can be longer than expected.

0.3.2 The tasks are better handled by statistical programming.

0.3.3 This is now the data science world.

0.3.4 Contact us at Analytics Consultina from any part of the world and we will Zoom into your country, office, campus, etc.

Book an appointment with Personnel Calendar using SetMore

1A1 - 1B2.4.3

1A1 Minimum and maximum

1A2 Quartiles and quartile range: median(Q1-Q3)

1A3 Mean, standard deviation and the coefficient of variation: mean±SD(CV)

1B1 In order to assess the mean or median difference of a numeric variable between two groups:

1B1.1 T-test is used for testing the mean difference of a normally distributed variable between two independent groups

1B1.2 Wilcoxon Rank sum test is used for testing the median difference of a non-normally distributed variable between two independent groups.

1B2 Assessing the mean or median difference of a numerical variable across at least three groups:

1B2.4 Post-hoc tests are conducted using the:

1B2.4.1 Following ANOVA parametric test – Tukey’s honestly significant difference (HSD) parametric test is used for single-step multiple mean comparison procedure should there be any significant difference in the means.

1B2.4.2 Following Kruskal Wallis non-parametric test – Dunn-test, non-parametric test is used for multiple median comparison procedure should there be any significant difference in the medians.

1B2.4.3 Following MANOVA test (a) Games-Howell test if variance homogeneity violated, (b) Tukey’s HSD if variance homogeneity met (c) etc.

1B2.1 - 1C4

1B2.1 ANOVA parametric test is used for testing the mean difference of a normally distributed single variable. (a) One-way i.e. with one categorical variable (b) r – way i.e. with r categorical variables.

1B2.2 Kruskal Wallis non-parametric test is used for testing the median difference of a non-normally distributed single variable. (a) One-way i.e. with one categorical variable (b) r – way i.e. with r categorical variables

1B2.3 MANOVA – usually parametric assumptions and at least two numeric variables (a) One-way i.e. with one categorical variable (b) r – way i.e. with r categorical variables.

1C Visual display of the numeric descriptive statistics (and in some cases annotated statistical tests) are presented in the form of:

1C1 Histograms – unidimensional

1C2 Scatterplots – two numeric variables (a) Gam, (b) LOESS and (c) Linear correlation + p-values + Regression line

1C3 (a) Box plots or (b) Violin plots – one numeric variable with one grouping categorical variable and one faceting categorical variable (c) Paired box plots – dependent samples (d) Multidimensional box plots – many numeric variables with one grouping categorical variable

1C4 Line graph – (a) ANOVA one way – one numeric variable with one categorical variable or (b) ANOVA two way – one numeric variable with two categorical variables.

1C5 - 1C11

1C5 Tukey pairwise – one numeric variable and one categorical variable with at least three categories.

1C6 Bar chart – one numeric variable and one categorical variable with possibility of faceting. Suitable when one of the categorical variables has an unusual higher number of levels. Summary measure (a) mean, (b) median, (c) mode and (d) CV

1C7 ­Dot plot – one numeric variable and one categorical variable + one more categorical variable for faceting. Suitable when one of the categorical variables has an unusual higher number of levels.

1C8 Slope graph – one numeric variable and two categorical variables. The two categorical variables can be initially supplied as numeric and internally transformed into categorical (5 classes by default and option to change). Suitable when one of the categorical variables has an unusual higher number of levels. Summary measure (a) mean, (b) median, (c) mode and (d) CV

1C9 Parallel plot – multidimensional numeric variables with one categorical variable. Summary measure (a) mean, (b) median, (c) mode and (d) CV

1C10 Density plot – multidimensional numeric variables (a) with or (b) without a grouping categorical variable.

1C11 Correlation plots – multidimensional numeric variables.

Book an appointment with Personnel Calendar using SetMore

2A - 2B4

2A Categorical variables are described as counts and percentage frequencies.

2B The omnibus test for association between two categorical variables is determined by:

2B1 Chi square test for large expected frequencies (>5%)

2B2 Fisher’s exact test for small expected frequencies (<=5%)

2B3 In the case of significant omnibus test for r x 2 table, the post-hoc test for equal row wise proportions is also conducted by pairwise row wise proportional test using:

2B3.1 Chi square test for large expected frequencies (>5%)

2B3.2 Fisher’s exact test for small expected frequencies (<=5%)

2B4 McNemar test for paired test of nominal data – 2×2 table. That is, test for differences on a dichotomous dependent variable between two related/dependent groups

2C1 - 2C2.3

Visual display of the categorical descriptive statistics and in some cases annotated statistical tests are presented.

2C1 Pie chart – unidimensional or bivariate + faceting

2C2 Simple bar charts – unidimensional and multidimensional categorical variables

2C2.1 Simple bar chart – one categorical variables

2C2.2 Multiple bar chart – two categorical variables

2C2.3 Pareto chart – unidimensional (a) Standard (b) Standard + faceting (c) Unnested – row separation of multiple responses (d) Unnested – row separation of multiple responses + faceting.

2C3 - 2C7

 

2C3 Component bar chart – two categorical variables + faceting

2C4 Alluvial diagrams – multidimensional categorical variables. Equivalent to a cross tabulation of several variables. Show flows with category percentages.

2C5 Icon plots – multidimensional categorical variables. Equivalent to a cross tabulation of several variables. Simply visually show relative proportions as boxes with no percentages

2C6 (a) Likert bar plot and (b) Heatmap – multidimensional categorical variables. Suitable for Likert scale type questions.

2C7 Strength of association plots – multidimensional categorical variables. Correlation plots based on values from Cramer’s V, etc.

Book an appointment with Personnel Calendar using SetMore

3A1 - 3A1.2

Regression based with model accuracy being key for prediction. Pre and post flight variable selection + stepwise

3A1 Categorical dependent variable

3A1.1 Binary logistic regression – dependent variable with 2 categories (a) without stepwise regression (b) with stepwise regression

3A1.1.1 Forest plot (a) Stepwise model (b) Full model

3A1.2 Multinomial logistic regression– dependent variable with at least 3 categories (a) without stepwise regression (b) with stepwise regression

3A1.2.1 - 3B2.2

 

3A1.2.1 Effects plot

3A1.3 Survival analysis – dependent variable with 2 categories and time to event (a) without stepwise regression (b) with stepwise regression

3A1.3.1 Kaplan Meier curves

3B2 Numerical dependent variable

3B2.1 Poisson – counts with typical Poisson distribution and relatively few zeros

3B2.1.1 Interaction plot

3B2.2 Quasi-poisson – counts with overdispersion

 

3B2.3 - 3C3.1

3B2.3 Hurdle, Negative binomial – counts having too many zeros

3B2.4 Gamma – counts >=1

3B2.5 Gaussian – normally distributed

3C3 Multivariate – at least two dependent variables analyzed simultaneously with other variables 

3C3.1 Structural Equation Models (SEM)
(a) Exploratory Factor Analysis (EFA) (b) Confirmatory Factor Analysis (CFA)

And many more as may be required by the project objectives

Book an appointment with Personnel Calendar using SetMore

4A1 - 4A2

4A1 Reliability analysis – Cronbach alpha

4A2 Inter-rater analysis – (a) ICC, (b) Kappa, (c) etc

 

4A3 - 4A4.1

4A3 Case-control analysis

4A4 Signal detection theory

4A4.1 Sensitivity and Specificity analysis

4A4.2

4A4.2 Optimal cut-off points based on (a) Youden’s index, (b) etc.

And many more as may be required by the project objectives

Book an appointment with Personnel Calendar using SetMore

5: Statistical data analysis software

5A1

At Analytics Consultina we strongly believe in and use the most powerful statistical software in the world.
 

5A1 R Statistical Computing Software of the R Core Team, 2020, version 3.6.3.

All our macros are currently compatible with version 3.6.3

 

5A2-5A3

5A2 Statistical Package for Social Scientists (SPSS) for IBM. Currently using version 27.

5A3 Analysis of moment structures (AMOS) for IBM. Currently using version 27. AMOS is specifically developed for Structural Equation Models (SEM)

5A4-5B1

5A4 Statistics/Data Analysis (STATA)

5A5 Statistical Analysis System (SAS)

5B1 NVivo

And many more as may be required by the project objectives

Book an appointment with Personnel Calendar using SetMore