 Top

Sample Size

Statistics is the study of collecting, organizing, analyzing, summarizing data and drawing inferences from the data so worked on. In statistics, we come across with two types of data - population data and sample data. Population data is the large amount of data that includes the whole area of study which is termed as population. A population consists of all the elements that are studied for the research.

On the other hand, a sample data is a part of the population. Usually, it is quite clumsy and difficult to compute whole population. In this case, a representative sample is selected from the population. This sample is termed as sample data.

For example, if the study is related to the income of all practicing physicians in the  US, then the population will include all physicians in the country. But, due to financial and temporal constraints, the study of the entire population is not generally feasible. Hence, a representative sample is made randomly choosing elements from the population and the sample characteristics are studied.

These sample statistics are then generalized to describe the population parameters and inferences are drawn from this. It is very important to know how many samples are to be selected from the population in order to get the right result. The number of elements in such a sample is called the sample size, denoted by the lower case letter 'n' while the population size is denoted by upper case letter 'N'.

 Related Calculators effect size sample size calculator Anova Sample Size Calculator Estimating Sample Size Calculator Chi Square Sample Size Calculator

Definition

The number of observations used for calculating estimates of a given population. The size of the sample and the way in which it has been drawn from the population. Sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population.The number of entities in a subset of a population selected for analysis.

Small Sample Size

Sometimes it can be very small. When the sample size is small (n < 30), we use the t distribution in place of the normal distribution. If the population variance is unknown and the sample size is small, then we use the t statistic to test the null hypothesis with both one-tailed and two-tailed.

t = $\frac{\bar X - \mu}{\frac{s}{\sqrt{n}}}$

Large Sample Size

Generate for more accurate estimates but too large a might cause difficulties in interpreting the usual tests of significance and same problem may be arises in case of vary small sample size. Thus, neither too large nor too small sample sizes help research projects.

Formula

Formula for the infinite population

$SS$ = $\frac{Z^2p(1-p)}{C^2}$

Where, $SS$ = Sample Size
$Z$ = Z-value
$P$ = Percentage of population
$C$ = Confidence interval.

Also, when sample data is collected and the sample mean $\bar x$ is calculated, that sample mean is typically different from the population mean $\mu$. This difference between the sample and population means can be thought of as an error E, is the maximum difference between the observed sample mean and the true value of the population mean.

$E = Z_{\frac{\alpha }{2}}(\frac{\sigma }{\sqrt{n}})$ can be solved for n, which can be used to determine the minimum sample size must be used in order to assure a given level of confidence and maximum error allowed.

$n = (\frac{Z_{\frac{\alpha }{2}}\sigma }{E})^{2}$

Solved Example

Question: Assuming the heights of students in a college campus is normally distributed with a standard deviation = 5 in, find the minimum size required to construct a 95% confidence interval for mean with a maximum error = 0.5 in.

Solution:

It is given E = 0.5 in, σ = 5 and α = 1 - 0.95 = 0.05

Hence $Z_{\frac{\alpha }{2}}=Z_{0.025}= 1.96$

$n=(\frac{Z_{\frac{\alpha }{2}}\sigma }{E})^{2}$

$n=(\frac{1.96(5)}{0.5})^{2}= 384.16$

Rounding this value up to the next integer, the minimum sample size required = 385.

Determining

Determining for Research Activities

Needs to be statistically significant, means it is not chosen arbitrarily or by chance. If so what are the factors that influence the selection of the sample size. The accuracy level of the estimates that the researcher proposes to present and the error margin he would allow for the estimates determines. The population size is also taken into note.

The researcher fixes a maximum size often imposed by the amount budgeted for the sampling and the maximum error permissible and the accuracy the researcher would like to vouchsafe determines the minimum size required for the sample. An optimal, so chosen would be hence statistically significant. Experiments, surveys and other research studies are much more likely to yield useful results if they are carefully planned and you obtain a suitable number of samples.

Determination Table

Given below is the table for determining for a given population Maximum Error in an Estimate

One important aspect of inferential statistics is the estimation of population parameters using sample statistics. For example, a researcher needs to estimate the population mean income for physicians from a representative sample of physicians chosen randomly using their registration identities. For this purpose he can use sample mean as a point estimate. But better would be an interval estimates which is likely to contain the true mean. He finds out the maximum error that can allowed for the estimate and also fixes a confidence level for his estimates. The confidence tells how confident he is that the interval contains the estimate.

For example a confidence level of 90% for an error margin of 4%, means that 90% of survey results will result in estimates that is within the error margin of 4%. The confidence level is a pre assigned percent before the interval estimate is made.

Using the confidence level and z-score table (when the population variance is known) or student's t-table, the maximum error of estimate is calculated. The following maps the formula used to the known characteristic of the variable.

 Maximum Error of estimate E Population variance $\sigma$is known $E=Z_{\frac{\alpha }{2}}(\frac{\sigma }{\sqrt{n}})$ Population variance is unknownSample size n ≥ 30 $E=Z_{\frac{\alpha }{2}}(\frac{s}{\sqrt{n}})$ Popular variance is unknown andn < 30 $E=t_{\frac{\alpha }{2}}(\frac{s}{\sqrt{n}})$

α in the formula is equal to 1- confidence level expressed as a decimal.

$z_{\frac{\alpha }{2}}$ is the Z-score corresponding to a two tail area of α in the Standardized normal distribution and $t_{\frac{\alpha }{2}}$ is the t-score corresponding to two tail area of α in the Student's- t distribution table.

Standard Deviation

Standard deviation is the square roots of the sum of the squared differences between each score and the mean average of all scores.

For a sample the formula for the standard deviation is:

$SD$ = $\sqrt{Variance}$

where, Variance = $\frac{\sum(x_i-\bar x_a)^2}{n-1}$

$SD$ = Standard deviation of a population

$x_i$ = each of the observations

$\bar x_a$ = mean average, and

$n$ = the number of scores.

Distribution of Samples

If many random samples of a specific size are taken from a population with replacement, then the means computed for these samples form a sampling distribution of sample means.

The central limit theorem gives the characteristics and shape of such a distribution of samples.

 Central Limit Theorem As the size of the sample n increases without bounds, the shape of the distribution of sample means taken with replacement from a population with mean μ and standard deviation σ will approach a normal distribution. The mean of this distribution = μ and its standard deviation = $\frac{\sigma }{\sqrt{n}}$.

The standard deviation of the distribution of samples is dependent on the specific sample size 'n' taken for the distribution and will be less than the standard deviation of the population.

As the sample size increases, the standard deviation of the distribution of samples will decrease.

The theorem tells, irrespective of the shape of the population distribution, the distribution of samples is normally distributed for large samples. The z scores for various $\overline{x}$ values can be computed using the formula,

$z = \frac{\overline{x}-\mu }{\frac{\sigma }{\sqrt{n}}}$

If the population variance is not known, the sample variance can be computed and used as a point estimate for the population variance when the sample size > 30. In such instances the formula will be,

$z = \frac{\overline{x}-\mu }{\frac{s}{\sqrt{n}}}$

The expression $\frac{s}{\sqrt{n}}$ is known as the standard error of the distribution of the samples and we can see that this is dependent on it.

Solved Example

Question: Repeated samples of size 49 are drawn from a population with mean = 40 and standard deviation = 15.  Find the probability that a sample picked from this distribution will have a mean less than 35.

Solution:

By central limit theorem, the distribution of samples is approximately normally distributed with mean μ = 40 and

Standard deviation = $\frac{\sigma }{\sqrt{n}}=\frac{15}{\sqrt{49}}=\frac{15}{7}$

The z score for $\overline{x}=35$ can be found using the formula

$z=\frac{\overline{x}-\mu }{\frac{s}{\sqrt{n}}}=\frac{35-40}{\frac{15}{7}}=-2.33$

P($\overline{x}$ < 35)
=P(z < -2.33)
=0.0099

Probability mean < 35 = 0.0099.

Power Analysis

Used to determining the sample size needed to achieve desired levels of power. It is related to the power, the size of the type 1 error, the actual size of the effect, and the size of experimental error. Can be determine by given significance criterion and a desired level of power, it is easy to solve for the sample size needed.

Logistic Regression

Logistic regression is a statistical method for analyzing a data set in which there are one or more independent variables that determine an outcome. The size for a simple logistic regression can be calculated from the formula for a two sample t-test.

Logistic regression can be calculated by the following formulas:

• Single dichotomous predictor
• Single quantitative predictor and
• Multiple predictors

Table 