1 aIc and amIcomp implement simple test functions to determine if a particular dataset and transformation exhibit data pathologies that we can understand and control for. Since ‘all models are wrong’[Box:1976] it is important to know at the outset what is wrong, why it is wrong, and how that wrongness would hinder general interpretability in some way.

To answer the question, posed by the name of the package, “am I compositional?” the answer is always yes! However, a more nuanced answer is that the effect of compositional limitations on the analysis can be large or small depending on the dataset, the normalization and the analysis conducted.

Data normalization is an accepted preprocessing step used to provide desirable statistical characteristics so that the data can be analyzed using standard approaches. These normalizations are usually chosen to address the mean-variance relationship so that differential abundance or linear models can be used to assess how the features differ between groups. However, the effect on other properties of the data are usually not explored when the transformed data are used for dimension reduction and ordination, clustering, feature selection and network analysis. All of these downstream methods assume that distances and correlations between features (anything that has a unique molecular identifier such as a gene, protein, function, operational taxonomic unit, etc) is relatively unaffected by the transformation, although this is generally not examined.

Pathologies in data have knock-on effects for analysis and that one should be aware of this prior to starting an analysis. If the data are behaving as true counts, then our inferences should not change with these alterations. However, it was recognized early on that ‘normalization’ was a key aspect of reproducible analyses. This led to a profusion of normalizations being used during the analysis of high throughput sequencing [Dillies:2013] and several attempts to unify normalization across separate domains of research that all used the same technology platforms [McMurdie:2014, Fernandes:2014]. The realization that HTS instruments delivered multivariate count-compositional data [Friedman:2012,fernandes:2013,Lovell:2015] suggested a way forward. However, there is still considerable uncertainty about the best approaches since non-compositional approaches often appear to give reasonable answers in some contexts.

Therefore, the purpose of this tool is to allow the user to investigate if a particular combination of normalization, dataset and analysis will allow robust interpretations for analyses other than differential abundance or if a different normalization would be better.

One final complication is the number and distribution of 0 values in the dataset. This depends on the makeup of the original samples, and on the sequencing depth. If the groups of samples have different background distributions (say different sets of genes expressed in each group, or different bacterial species in each group), then there will be an asymmetric 0 distribution. If the samples are sequenced too shallowly, then features with low values will drop out and be represented as 0 values. If an asymmetric dataset is sequenced shallowly, then both pathologies will occur. The effect of adjusting the compositional basis (denominator) of the datasets can be explored using the iqlr and lvha approaches that attempt to identify unvarying denominators[Wu2021].

1.1 List of common data pathologies:

  1. Singular covariance matrix: This imposes analysis constraints on the data especially for ordination and clustering. In practice almost all HTS datasets are singular because they are high dimensional: there are more samples than features, although single-cell transcriptome analysis has a chance of being non-singular after filtering. However, even non-singular starting matrices are converted to singular matrices by many common transforms. This property should be assumed for all datasets, but is included for completeness.

  2. Scale invariance: In the context of high throughput sequencing (HTS), the same library can be sequenced on instruments that give vastly different numbers of output reads (from 1M to \(\ge\) 2B reads), and there can be unanticipated events that occur during sample processing that result in batch effects, etc. This could result in datasets with vastly different numbers of reads per sample and the associated gain or loss of counts in low-count features (censoring), or systemic deviations between features. This ‘library size’ problem is one of the main reasons for data normalization, as without it, ordination often shows that the major contributor to the variance in the data is the number of reads per sample rather than the biological effect being examined. This property means that changing the scale of the data — multiplying or dividing by and arbitrary value — should have no or only a minimal effect on the distances between samples.

  3. Sub-compositional distance dominance: This property means that any subset of the data should have smaller distances between samples than the full dataset (or any superset of the subset). This property should be met by any compositionally appropriate transform which converts features to be linearly different (as the centred log-ratio transform does).

  4. Perturbation invariance: This property means that simple perturbations of the data or a subset of the data should have minimal effects on the outcome. In particular, perturbations should have minimal effects on distances between samples. This is because in a composition, a perturbation, a multiplicative change in the feature, simply moves the dataset from one place to another in the sample space.

  5. Correlation coherence: This means that correlations between features are not identical for the full dataset and any feature subset of that dataset. This is not preserved for any transformation tested here and these papers will get you started1. Sub-compositional correlation coherence is possible with isometric log-ratios but these are not instantiated in differential abundance packages.

1.2 Properties of datasets

High throughput sequencing datasets share study-design specific properties, as shown in Figure 1 which plots the relative abundance vs. dispersion for several different types of datasets. We can see that transcriptome datasets, whether bulk or single-cell have a tight relationship, and that all other types of data have a less predictable type of relationship. Note that the in-vitro selection dataset (selex) has a nearly independent relationship between these two parameters. For this reason, it can be difficult to predict how different normalizations will behave and hence the need for this tool.

2 Example functions

The selex dataset2 is a useful example because we have a standard of truth, and because it behaves so badly with so many tools since it does not have a predictable relative abundance-variance relationship. It is the test dataset in the ALDEx2 R package3.

2.1 Sub-compositions and correlation (coherence)

The first test is for correlation in a full dataset and a subset of the data and if different the data are compostions and need to be treated as such. This is a correlation coherence test. Correlations in compositional data are not stable to subsetting and any transformation that is not coherent should not be trusted to build correlation networks unless one is using compositionally appropriate methods.

As a simple example, let’s try the selex dataset with the clr and log of the edgeR TMM transformation. Here the expected result is that the features are not sub-compostionally coherent under the test as shown by Lovell in 2015.

In other words, correlations change as different filtering and pre-processing steps are performed on the data. If this happens in your data with your transformation do not use correlation, and use a compositionally appropriate measure of association.

test.cor.clr <- aIc.coherent(selex, group=group, zero.method='prior', 
  norm.method='clr', log=F, cor.test='spearman')
## computing center with all features
## computing center with all features
test.cor.TMM <- aIc.coherent(selex, group=group, zero.method='prior', 
  norm.method='TMM', log=F, cor.test='spearman')
## [1] "No"
## [1] "No"
# plot the results
## Warning in plot.window(...): "ces" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "ces" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "ces" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "ces" is not a
## graphical parameter
## Warning in box(...): "ces" is not a graphical parameter
## Warning in title(...): "ces" is not a graphical parameter