Home
How to Use the Hypergeometric Distribution Formula for Sampling Without Replacement
The hypergeometric distribution is a fundamental discrete probability distribution used in statistics to calculate the likelihood of a specific number of successes in a series of draws from a finite population. Unlike the binomial distribution, which assumes each trial is independent, the hypergeometric distribution is specifically designed for scenarios involving sampling without replacement. This means that each draw changes the composition of the remaining population, making the trials dependent on one another.
Understanding the Hypergeometric Distribution Formula
To calculate the probability of obtaining exactly $x$ successes in a sample of size $n$, the hypergeometric distribution formula is expressed as:
$$P(X = x) = \frac{\binom{K}{x} \binom{N-K}{n-x}}{\binom{N}{n}}$$
In this formula, the notation $\binom{a}{b}$ represents the binomial coefficient, often read as "$a$ choose $b$," which is calculated as:
$$\binom{a}{b} = \frac{a!}{b!(a-b)!}$$
Variable Definitions
To apply the formula correctly, it is essential to understand what each variable represents within the context of your data:
- $N$ (Population Size): The total number of items in the entire group or population being studied.
- $K$ (Number of Successes in Population): The total number of items in the population that possess the specific characteristic you are looking for (the "success" states).
- $n$ (Sample Size): The number of items drawn from the population for the current observation or experiment.
- $x$ (Number of Observed Successes): The specific number of items in your sample that actually possess the desired characteristic.
The Mathematical Logic Behind the Formula
The beauty of the hypergeometric formula lies in its combinatorial logic. It essentially functions as a ratio of "favorable outcomes" to "total possible outcomes."
1. The Numerator: Counting Favorable Combinations
The numerator is composed of two parts:
- $\binom{K}{x}$: This counts how many different ways you can select exactly $x$ successful items from the total $K$ successes available in the population.
- $\binom{N-K}{n-x}$: This counts the number of ways to select the remaining items in your sample ($n - x$) from the "failure" group (the items that are not successes, which total $N - K$).
By multiplying these two combinations, you determine the total number of distinct ways a sample of size $n$ can be formed containing exactly $x$ successes.
2. The Denominator: Counting All Possible Combinations
The denominator, $\binom{N}{n}$, represents the total number of ways to choose any sample of size $n$ from the entire population $N$, without regard to whether they are successes or failures. Dividing the numerator by this value gives the exact probability of the specific outcome $x$.
When to Use the Hypergeometric Distribution
Statistical models only work when their underlying assumptions are met. For a scenario to be modeled using the hypergeometric distribution, the following four conditions must be true:
- The population is finite: You must know the exact total number of items ($N$).
- Two possible outcomes: Each item must be clearly categorized as either a "success" or a "failure" (e.g., defective vs. non-defective, red vs. black).
- No replacement: Once an item is selected from the population, it is not returned. This is the defining characteristic that separates it from the binomial distribution.
- Dependent trials: Because items are not replaced, the probability of drawing a success changes with every subsequent draw.
Hypergeometric vs. Binomial Distribution: Key Differences
One of the most common points of confusion in probability theory is deciding between the hypergeometric and binomial formulas. The choice hinges entirely on the concept of replacement.
| Feature | Hypergeometric Distribution | Binomial Distribution |
|---|---|---|
| Sampling Method | Without replacement | With replacement |
| Independence | Trials are dependent | Trials are independent |
| Probability of Success | Changes with each draw | Constant for every trial |
| Population Size | Must be finite and known | Can be infinite or unknown |
Practical Tip: If the population size $N$ is extremely large compared to the sample size $n$ (usually if $n < 0.05N$), the change in probability caused by not replacing an item is so small that it becomes negligible. In such cases, statisticians often use the binomial distribution as a simplified approximation of the hypergeometric distribution.
Step-by-Step Calculation Example
Suppose a quality control inspector is examining a box of 50 electronic components. It is known that 5 of these components are defective. If the inspector randomly selects 10 components without replacement, what is the probability that exactly 2 of them are defective?
Identify the Variables:
- $N = 50$ (Total components)
- $K = 5$ (Total defective components)
- $n = 10$ (Number of components sampled)
- $x = 2$ (Target number of defectives in the sample)
Set Up the Formula:
$$P(X = 2) = \frac{\binom{5}{2} \binom{50-5}{10-2}}{\binom{50}{10}} = \frac{\binom{5}{2} \binom{45}{8}}{\binom{50}{10}}$$
Calculate the Combinations:
- $\binom{5}{2} = \frac{5!}{2!(3!)} = 10$
- $\binom{45}{8} = \frac{45!}{8!(37!)} = 215,553,195$
- $\binom{50}{10} = \frac{50!}{10!(40!)} = 10,272,278,170$
Solve:
$$P(X = 2) = \frac{10 \times 215,553,195}{10,272,278,170} \approx 0.2098$$
There is approximately a 20.98% chance that the inspector will find exactly 2 defective components in a sample of 10.
Real-World Applications
The hypergeometric distribution is not just a theoretical exercise; it is used across various industries to solve complex problems:
- Quality Control and Auditing: Companies use this formula to determine the likelihood of finding errors or defects in a batch of products without having to test the entire inventory.
- Ecology and Wildlife Management: The "Capture-Recapture" method uses hypergeometric principles to estimate the total population of a species in the wild.
- Gaming and Card Games: In games like Texas Hold'em or Bridge, the hypergeometric distribution helps players calculate the probability of being dealt a specific hand or hitting a "draw" since cards are not returned to the deck after being dealt.
- Election Auditing: To verify the integrity of an election, auditors use this distribution to sample precincts and check if the observed discrepancies suggest a wider issue in the total population.
Properties of the Hypergeometric Distribution
Beyond the probability mass function (PMF), it is often useful to calculate the mean and variance of the distribution.
Mean (Expected Value)
The mean represents the average number of successes we expect to see in the sample: $$E[X] = n \frac{K}{N}$$
Variance
The variance measures the spread of the possible outcomes: $$Var(X) = n \frac{K}{N} \left( \frac{N-K}{N} \right) \left( \frac{N-n}{N-1} \right)$$ The term $\frac{N-n}{N-1}$ is known as the finite population correction factor, which accounts for the reduction in variance due to sampling from a finite group without replacement.
Summary of Key Points
The hypergeometric distribution is the gold-standard tool for calculating probabilities in finite populations where sampling occurs without replacement. By understanding the relationship between the total population, the number of successes, and the sample size, you can accurately predict outcomes in fields ranging from manufacturing to ecological science. Remember that as the population size grows relative to the sample, the hypergeometric distribution begins to mirror the binomial distribution, providing a bridge between dependent and independent probability models.
Frequently Asked Questions (FAQ)
What is the difference between the hypergeometric distribution and the hypergeometric test?
The hypergeometric distribution is the underlying mathematical model that describes the probabilities of outcomes. The hypergeometric test is a statistical test that uses this distribution to determine if a sub-population is significantly over-represented or under-represented in a sample (often used in gene set enrichment analysis).
Can $x$ be greater than $K$ or $n$?
No. By definition, the number of successes in your sample ($x$) cannot exceed the total number of successes available in the population ($K$), nor can it exceed the total number of items you have sampled ($n$). In such cases, the combination $\binom{K}{x}$ or the formula itself would result in a probability of zero.
Why is "without replacement" so important?
In "without replacement" sampling, each draw changes the probability of the next. For example, if you draw a red ball from a bag, there is one fewer red ball for the next draw. This dependency is what the hypergeometric formula captures, which the binomial formula ignores.
What are the parameters of the hypergeometric distribution?
The three parameters are the population size ($N$), the number of success states in the population ($K$), and the number of draws ($n$). The variable $x$ is the random variable representing the observed successes.
-
Topic: Hypergeometric distribution (fhttp://www.math.wm.edu/~leemis/chart/UDR/PDFs/Hypergeometric.pdf
-
Topic: Hypergeometric distribution - Wikipediahttps://en.wikipedia.org/wiki/Hypergeometric_distribution?oldid=928387090
-
Topic: Examples of the Hypergeometric Distributionhttps://www.math.ucdavis.edu/~tracy/courses/math135A/UsefullCourseMaterial/hyperGeom.pdf