Sampling Methods and Algorithms in Data Analysis: Probability Sampling

Sampling is a fundamental process in data analysis. It involves selecting a subset of individuals from a larger population to study. By analyzing this sample, researchers can draw reliable conclusions about the structural and statistical properties of the entire population. In this article, we will look into the classification of sampling techniques, and discuss the first group — probability sampling — in more detail.

Originally developed within the field of statistics, sampling has evolved to play a crucial role in machine learning, where it creates training, testing, and validation datasets for model development.

Key Challenges in Sampling

The accuracy and significance of insights derived from data analysis depend heavily on the effectiveness of the sampling method. A poorly designed sample can lead to biased results and inaccurate conclusions. Bias is a major challenge in sampling. It occurs when the sample doesn't accurately represent the population. This can lead to sampling error, which is the difference between the sample estimates and the true population values.

To mitigate the impact of bias and sampling error, statisticians employ various techniques and statistical tests to assess the significance of differences between sample estimates. By doing so, they can determine whether the findings from the sample can be generalized to the entire population with a certain level of confidence.

In statistics, a population is a complete set of objects (or data points) within a specific area of interest. Analyzing the entire population to discover its patterns can be impractical or impossible due to various reasons, including:

The population can contain a vast number of objects, making it impractical to analyze all of them due to excessive time, resource, and computational demands.
Collecting data on every individual in the population is often prohibitively expensive or infeasible, as illustrated by the challenge of surveying an entire city's population.
Analyzing the entire population is unnecessary, as a reliable model can be constructed from a representative sample.
The primary focus of the analysis may be a specific subset of the population that meets particular criteria, such as clients over 40 years old.

However, if the population is small and accessible (e.g., employees or clients of a small company), it may be feasible and even necessary to use the entire population.

Many machine learning models don't rely on assumptions about the data's distribution. However, they still use sampling, and the learning process is often based on training samples. These samples must accurately reflect the patterns in the data to ensure the model generalizes well to the entire population.

While sampling in statistics and machine learning may differ slightly, both require samples that are representative and complete. Representativeness means the sample accurately reflects the population's characteristics, while completeness ensures the sample is large enough to capture the necessary information.

It's important to note that representativeness and completeness are not always correlated. A large sample can be unrepresentative, and a small sample can be representative. Carefully considering sampling methods is crucial to obtain high-quality samples for analysis.

Figure 1. Population and sample

The Main Stages of Sample Construction

There is no universal sequence for implementing the sampling process. However, the following steps are typically involved:

Define the target population: Specify its members (e.g., people, households, companies, products, etc.), relevant characteristics, and geographical or temporal boundaries.
Create a sampling frame: Determine a subset of the population that meets specific criteria. In simple cases, the entire population can serve as the frame. E.g., people older than 18, customers with an income higher than average, etc. A sampling frame can be with or without replacement, specifying that each unit can or cannot be used more than once.
Select a sampling method and algorithm: Consider data type, sample size, and desired precision.
Determine the sample size based on desired precision and analysis complexity: In statistical research, the sample size must be large enough to ensure accurate estimation of population parameters. Similarly, in machine learning, the sample size should be sufficient for the model to learn and generalize well to unseen data. The specific sample size requirements can vary depending on the model used. For instance, the number data points in a neural network training set must typically exceed the number of connections between nodes to prevent overfitting.
Implement the sampling process: Consider factors like data source, volume, and network constraints.
If necessary, gather additional data on the sampled objects: such as demographics or preferences.

Classification of Sampling Methods

Figure 2. Classification of sampling methods

All sampling methods are categorized into two primary groups:

Nonprobability methods select samples based on specific rules or criteria. For example, a rule might specify, 'Select all men aged 30 to 40 years.' In this case, every individual fitting this description would be included in the sample.

Probability methods, on the other hand, assign a specific probability to each item's inclusion in the sample. This probability is defined by the sampling algorithm and determines the likelihood of an item being chosen.

In this article, we will focus on probability sampling.

Probability (Random) Sampling

Probability sampling is a technique in which each element in the population has a known, non-zero chance of being selected. This ensures a representative sample, making it a valuable tool in statistical analysis. Here are the different varieties of probability sampling.

Simple Random Sampling (SRS)

SRS is a method where each individual element in a population has an equal chance of being selected. Items are chosen randomly and independently. This is often done by assigning a unique number to each element, generating random numbers, and then selecting the elements corresponding to these random numbers.

SRS is the simplest of probability sampling methods. It can also form a part of a more complex sampling scheme.

SRS can be conducted with or without replacement.

With replacement (WR): An element can be selected multiple times during the sampling process.
Without replacement (WOR): An element can only be selected once.

WR sampling is particulary useful when the available population size is insufficient to create a desired training set.

WR sampling can sometimes result in duplicated items in a sample. However, if the initial population is large enough, the probability of selecting duplicates from a WR sampling frame is very low.

SRS is advantageous because it requires minimal prior knowledge of the population and its results are straightforward to interpret.

Here are some common algorithms for SRS:

The naive algorithm: Each unit of the population is assigned equal selection probability 1/N. Then, the algorithm performs N selection operations to form the sample.
The random sort algorithm: assigns a random number drawn from uniform distribution (0,1) as a key to each item, then sorts all items using the key and selects the smallest N items.
Reservoir Sampling (R-sampling): is a technique for creating a simple random sample without replacement from a set of elements whose size is unknown or too large to fit entirely in memory. The algorithm processes the elements sequentially, selecting a fixed number, k, of them to form a reservoir.

How it works

Initialization: The first k elements are directly added to the reservoir.
Iterative Selection: Each subsequent element is assigned an index number i>k. Then the algorithm generates a random integer j between 1 and i.
If j is less than or equal to k, the current element i replaces the element at position j in the reservoir.
Otherwise, the current element is discarded.

Key Points

Probability of Selection: Each element i has a probability of k/i to be included in the reservoir.
Memory Efficiency: The algorithm requires a fixed amount of memory, regardless of the input size.
Time Complexity: The algorithm can be relatively slow, especially for large input sizes.

Stratified Sampling

Stratified sampling is a technique in which the population is divided into smaller, more homogeneous subgroups, or strata.

To ensure accurate representation, the strata must be:

Collectively exhaustive: Every population element must belong to a stratum.
Mutually Exclusive: No element can belong to multiple strata.

Then, items are selected from each stratum, for example, by simple random sampling.

Figure 3. Stratified sampling

Two Key Stratification Techniques

Proportionate allocation: Samples are formed proportionally to the size of the corresponding groups (strata) in the population. For example, if a population consists of 60% men and 40% women, the final sample should also have 60% men and 40% women.

Disproportionate (optimum) allocation: The sampling fraction of each stratum is proportionate to both its population size (as above) and the standard deviation of the distribution of the variable. Larger samples are drawn from more variable strata to minimize overall sample variance.

Conditions for Effective Stratified Sampling

Within-strata variance is minimal.
Between-strata variance is maximal.
Stratification variables are highly correlated with the outcome variable.

Advantages of Stratified Sampling

Focus on important groups: Allows for targeted sampling of specific groups.
Flexible sampling methods: Different sampling techniques can be applied to different strata.
Reduced sampling error: Improved precision in estimates of the general population based on sample estimates.
Manageable sampling process: Simplifies the sampling process.

Disadvantages of Stratified Sampling

Subjective stratification: Choosing appropriate stratification variables can be challenging.
Ineffective for homogeneous populations: Not suitable for populations without distinct subgroupsб or when the groups cannot be exhaustive.
Yule – Simpson Paradox: Trends observed within strata may disappear or reverse when combined.
Requires prior knowledge: Information about the population's characteristics (e.g., demographics or income) is necessary.

By carefully considering these factors, researchers can effectively use stratified sampling to enhance the quality and reliability of their studies.

Probability Proportional to Size Sampling (PPS)

Sometimes, the original population contains a variable that correlates with the feature of interest for grouping. This variable, known as an auxiliary variable or size measure, can enhance sample accuracy. It essentially determines the significance of each object, which, in turn, influences its probability of inclusion in the sample.

Example

Consider a marketing survey where you aim to sample customers. If you anticipate that survey results might correlate with the age of respondents, age can serve as a size measure.

To implement PPS sampling:

Categorize the population: Divide customers into age categories, such as 21-30, 31-40, 41-50, and 51-60.
Determine category sizes: Let's say that the categories in our example contain 500, 700, 1000, and 300 customers, respectively (totalling 2500 customers).
Calculate selection probabilities: The probability of selecting a customer from a specific category (Pi) is calculated as: $P_i = \frac{N_i}{N}$

Where:

Pi: Probability of selection for the i-th category

Ni: Number of customers in the i-th category

N: Total number of customers

In our example, the probability of selecting a customer from the second age category (age 31–40, 700 customers) is:

Pi: = 700 / 2500 = 0.28

By employing PPS sampling, we can:

Target the sample: Focus on items that are more likely to impact the analysis results.
Increase sample accuracy: Obtain a more representative sample by prioritizing significant objects.

This technique is beneficial when dealing with heterogeneous populations where certain groups have a disproportionate influence on the overall outcome.

Cluster Sampling

This is a probability sampling technique where the population is divided into distinct groups known as clusters. These clusters can be based on various criteria, such as geographical location (e.g. cities, states, countries) or specific categories (e.g. industries, product types). Unlike data clustering techniques, these clusters are not formed based on similarity but rather on predefined categories or natural groupings, and can contain items with significant feature variations. For instance, a city cluster might include residents of diverse ages and income levels.

To be effective, cluster sampling requires that:

Items in each cluster have equal chances of being selected.
Clusters are mutually exclusive (each item belongs to only one cluster).
Clusters are exhaustive (each item is in some cluster).

After forming such clusters, a number of them are selected randomly to represent the total population. Then, all items within the selected clusters are included into the final sample (no items from non-selected clusters are included). This makes cluster sampling different from stratified sampling, where some elements are sampled from each stratum. This type of sampling is more precisely called single-stage cluster sampling.

Multistage Sampling

Multistage sampling can be viewed as a more complex form of cluster sampling, as it also requires dividing the population into clusters. However, assessing all elements of each selected cluster may be overly expensive or unnecessary. Instead, the researcher can randomly select a proportional or fixed number of items from each cluster — this would be the second stage of multistage sampling.

There may be more stages of sampling, where clusters are selected at each stage. For instance, a researcher might randomly select countries, then provinces in them, then cities within provinces, and finally, some of the households in each city. This approach aims to reduce study time and costs by filtering out unnecessary observations.

Cluster sampling is particularly useful for reducing costs in geographically distributed studies, which is why it's often referred to as geographic, territorial, or regional sampling.

Figure 4. Two-stage cluster sampling

Systematic Sampling

Systematic, or interval sampling involves selecting elements from a list or sequence at a regular interval. Commonly, all items are assigned equal selection probability.

How it works

Sampling Interval: Suppose the sampling frame has N units, and you need a sample of n units. The sampling interval k is calculated by formula k = N/n.
Random Start: A random number, h, between 1 and k, is chosen as the starting point.
Repeat: Every k-th element is selected from the sampling frame until the desired sample size n is reached.

Example: If you want to select 100 people from a population of 1000, you would choose a random starting point between 1 and 10. Then, you would select every 10th person, counting from the starting point.

Important Considerations

Homogeneity: Systematic sampling works best when the population is homogeneous, meaning it's relatively uniform.
Cyclic Patterns: If the population has cyclic patterns that align with the interval k, the sample may not be representative.
Randomization: To mitigate this risk, it's often beneficial to randomly select the population list's order before applying systematic sampling.

By following these guidelines, systematic sampling can be a reliable and efficient method for selecting representative samples.

About Megaladata

Megaladata is a low code platform for advanced analytics

A solution for a wide range of business problems that require processing large volumes of data, implementing complex logic, and applying machine learning methods.

GET STARTED!

It's free