# Interpreting Scatterplots in Genotyping Experiments

## Minor allele frequency and sample size in interpreting genotyping scatterplots

If you are new to single nucleotide polymorphism (SNP) genotyping experiments, you may be struggling with how to interpret your data. The allelic discrimination (AD) scatterplots you generate should discriminate between homozygotes and heterozygotes for your SNP, clearly illustrating the prevalence and distribution of the SNP in your sample set. But what if you aren’t sure that your data are correct or the plot doesn't look as you expected?

In this article we will cover some basic concepts that will help you to assess your data and identify any errors. These concepts include sample size, minor allele frequency (MAF), and the Hardy-Weinberg equilibrium. If your data appear incomplete, or if you aren’t sure how to interpret AD plots, read on for some helpful examples.

We will also help you to calculate appropriate sample sizes before you begin your experiment to ensure you can successfully detect a SNP with a low minor allele frequency in your target population. The MAF measures the frequency of the less-frequent allele in a given population. Put simply, the MAF is the proportion of a given population expected to carry the less frequent (minor) allele for a SNP with the remainder (the majority) of the population carrying the major allele.

Let’s begin by examining a typical AD scatterplot generated from a standard SNP genotyping experiment (Figure 1). Each dot on the plot represents one sample, derived from one individual and one assay in a single well. If the SNP of interest has a high MAF, the plot should show three distinct clusters from your samples—one for minor allele homozygotes, one for major allele homozygotes, and one for heterozygotes. The plot should also include a cluster of no-template controls (NTCs) closer to the origin.

It is important to note that Allele 1 does not always correspond to the major allele. You must verify the allele each dye is associated with by checking the assay’s context sequence which is available in our TaqMan assay search tool and in your order Assay Information File. The context sequence is the nucleotide sequence surrounding the SNP site. It is provided in the (+) genome strand orientation relative to the NCBI reference genome. The SNP alleles are included in brackets, where the order of the alleles corresponds to the association with probe reporter dyes, where [Allele 1 = VIC™ dye / Allele 2 = FAM™ dye]. The context sequences found on the publicly available databases do not follow this convention and can map to either genomic strand. This is important to understand, particularly for AT and CG SNPs.

In our sample scatterplot, the minor allele is called Allele 2 and the major allele is called Allele 1. Allele 1 homozygotes (Allele 1/Allele 1) are clustered in red in the bottom right corner, Allele 2 homozygotes (Allele 2/Allele 2) are clustered in dark blue in the top left corner, and heterozygotes (Allele 1/Allele 2) are clustered in green in the middle of the plot. NTCs are clustered in light blue and are oriented to the origin.

The scatterplot can help us discriminate between these groups based on the clustering. The clusters are formed because the SNP alleles are labeled using different fluorescent probes. In our sample scatterplot, Allele 2 has been labeled using FAM dye and Allele 1 has been labelled using VIC dye. The signal from the VIC dye is displayed on the X axis and the signal from the FAM dye is displayed on the Y axis. The distinct contribution of each dye and their co-localization in the case of heterozygotes leads to the distinct clusters in the plot. In this example, there are 40 experimental samples in total, including eighteen Allele 1 homozygotes, six Allele 2 homozygotes, and sixteen heterozygotes.

If an AD scatterplot is to discriminate properly between the alleles, the clusters should not be too close together or 'bleeding into' each other, and the points in each cluster should group closely together. Your scatterplot should include NTCs which should cluster in the bottom left of the plot and can be used to orient your sample clusters to an origin.

Assuming you are sampling a large population with no natural selection or mutations and the SNP resides on an autosomal chromosome, SNP genotyping data should follow the Hardy-Weinberg equilibrium, a distribution equation that describes the allele frequencies in a population, and therefore the clusters in an AD scatterplot. See Figure 1 for a dataset with three distinct clusters that complies with the Hardy-Weinberg Equilibrium.

If there are only one or two clusters on your scatterplot, your sample size may be too low for the MAF of the target population, and your sample may be too small to detect minor allele homozygotes.

Let's explain. The MAF is linked to the number of samples you will need: the smaller the MAF, the larger the number of samples you will need to observe enough minor allele homozygotes to achieve meaningful results. So, how can you figure out if your sample size was too small, or even better, calculate the sample size needed before you begin the experiment to guarantee meaningful results?

First, we need some information about the SNP we are investigating—specifically, its MAF. The MAF for specific SNPs is frequently available on our website. Our TaqMan assay search tool provides MAF data from 2 public sources (the 1000 Genomes Project and HapMap project) and Applied Biosystems data for validated TaqMan SNP and DME Genotyping assays, which were tested on up to 180 samples from African-American, Caucasian, Chinese, and Japanese populations.

You may also be able to find MAF information on publicly available websites, such as the NCBI’s dbSNP database. Our TaqMan assay search tool provides links from assay target SNPs to NCBI SNP web pages.

Let’s work through a sample calculation based on the SNP analyzed in the dataset above, which has a MAF of 0.39. This implies that 39% of the target population are expected to carry the minor allele and 61% are expected to carry the major allele.

The Hardy-Weinberg equilibrium equation is as follows:

q^2 + 2qp + p^2 = 1

where:

q = minor allele frequency fraction = 0.39

p = major allele frequency fraction = 0.61

Each term in the equation corresponds to an expected genotype frequency for the given population:

q^2 = proportion of homozygotes for the minor allele (qq)

2qp = proportion of heterozygotes (2qp)

p^2 = proportion of homozygotes for the major allele (pp)

For our example, the predicted frequencies are:

q^2 = 0.39 x 0.39 = 0.15

2qp = 2 x 0.39 x 0.61 = 0.48

p^2 = 0.61 x 0.61 = 0.37

Our example includes data from 40 samples. Let’s check the predicted frequencies above to calculate if the results of the experiment adhere to the Hardy-Weinberg equilibrium. Given that we analyzed 40 samples, we would expect to find the following:

q^2: 40 samples x 0.15 = 6 individuals

2qp: 40 samples x 0.48 = 19.2 individuals

p^2: 40 samples x 0.37 = 14.8 individuals

This yields expected figures of 6, 19, and 15 individuals and is broadly in agreement with the eighteen Allele 1 homozygotes, sixteen heterozygotes and six Allele 2 homozygotes displayed in the scatterplot. Therefore, these data broadly adhere to the Hardy-Weinberg equilibrium.

You can also use these predicted frequencies to assess the probable outcome of using a particular sample size.

If we were to test samples from just five individuals then we would expect to detect the following:

q^2 = 5 x 0.15 = 0.75 individuals = approximately 1 individual

2qp = 5 x 0.48 = 2.4 individuals = approximately 2 individuals

p^2 = 5 x 0.37 = 1.85 individuals = approximately 2 individuals

In this instance we may not observe three clusters in the final scatterplot. Unsurprisingly, our sample of five may be too small to detect even one minor allele homozygote.

But we can easily calculate the minimum sample size required to detect a specific minor allele in a target population. You can use the following formula to calculate the number of samples required to detect one homozygote for the minor allele, allowing you to plan your sample size ahead of your experiment.

Minimum sample size = 1 ÷ MAF^2

Let’s work this out for our sample SNP (MAF = 0.39)

Minimum Sample size = 1 ÷ (0.39 x 0.39)

Sample size = 1 ÷ 0.15

Sample size = 6.66

Therefore, for this SNP, we would need a sample size of approximately 6 or 7 individuals from the population to successfully detect one homozygote for the minor allele. This represents the bare minimum for likely detecting a single homozygote, so your sample should be larger than this to achieve a cluster of minor allele homozygote data points. The scatterplot in Figure 1 uses data from 40 samples, so there were enough samples to detect the minor allele and generate three distinct clusters in the scatterplot.

You should note, however, that the MAF of 0.39 used in this example is relatively high. Much larger samples are needed for SNPs with a very low MAF. A SNP with a MAF of 0.05 will require 400 samples to detect just one homozygote for the minor allele—try working this out for yourself using the formula above (sample size = 1 ÷ q^2).

If the required sample size for your SNP of interest is prohibitive for detecting all three genotypes, you can run control samples that will make it easier for your analysis software to correctly appraise your data. These include genetic reference materials, such as samples known to be heterozygous or homozygous for a particular allele, or even plasmid controls with the appropriate sequence.

This article has shown the importance of taking MAF and sample size into account when designing a genotyping experiment or when investigating why your scatter plot may not display three clusters. There are several other reasons why your scatterplot may not reveal three clusters, such as if the SNP of interest is on an X or Y chromosome or if the SNP is in a gene that displays copy number variance. Sample quality may also affect the clustering. For additional troubleshooting advice, consult Appendix A in the TaqMan SNP Genotyping Assays User Guide.

For Research Use Only. Not for use in diagnostic procedures.