Outline
Characterizing Data
- There are four ways of characterizing the data required by a model:
- Fit a distribution to system measurements.
- Sample a histogram of system measurements.
- Guess a distribution based on system characteristics.
- Use the system measurements as trace data.
Fitting Distributions
- Given some sampled data, find a distribution from which it might have
come.
- Then use the distribution to generate more data.
- The procedure is
- Create a frequency histogram from the sample data.
- Pick a distribution close to the frequency histogram.
- Calculate the mean and variance of the sample data.
- Calculate the mean and variance of the distribution.
- Compare goodness-of-fit between the data and distribution statistics.
Histogram Sampling
- Sometimes only the histograms of system measurements are available.
- There are two approaches in this case:
- Start from step 2 in distribution fitting.
- Use the histogram to derive an empirical distribution.
Characteristic-Based Selection
- Sometimes no measurements are available, or possible.
- Previous experience, analogous systems, or theoretical results can
provide intuition as to what to do.
- In the worst case, use distributions that match the characteristics of
the data of interest.
- Model component lifetimes with the Weibull distribution.
- Model retransmissions before success with the negative binomial
distribution.
- Model service times with the Erlang distribution.
- Model random proportions with the beta distribution.
- But you still need parameters for the distributions.
Historic Data
- Use prior data sets.
- Directly as inputs.
- Indirectly via random sampling.
- Problems:
- Simulation requirements may outstrip the available data.
- Biased, dirty, skewed, or otherwise compromised data.
- Limited ability to vary input data.
- Predictions are difficult to make without detailed knowledge of data
characteristics.
Empirical Distributions
- An empirical distribution is a distribution created from a
specific data set.
- General approach:
- Create frequency counts for the data.
- Form the cumulative relative frequencies of the data.
- Generate samples from the cumulative relative frequencies.
Frequency Counts
- A frequency count is the amount of data belonging to a particular
value range.
- Given a data set { x1, ..., xn }, create frequency counts
by
- Creating a set of value ranges { r1, ..., rm }; the
range set should
- be contiguous (ri.high == ri + 1.low).
- cover all possible values
- provide enough samples per value range
- For each range ri, count the value xj such that ri.low <= xj < ri.high.
Cumulative Relative Frequencies
- The cumulative relative frequencies indicate the growth of value
counts in a range of values.
- Given a set of frequency counts { f1, ..., fn }, form the
cumulative relative frequencies by
- Finding the normalization value S = sum(i = 1 to n, fi).
- Form the cumulative relative frequency ci = sum(j = 1 to i,
fj)/S.
Data Generation
- Given a set of cumulative relative frequencies, generate a data point
from the distribution as follows:
- Generate a sample x from the standard uniform distribution.
- Locate the first (leftmost) cumulative relative frequency containing
x.
- Generate a sample from the data distribution by either
- picking an endpoint of the cumulative relative frequency or
- interpolating between the cumulative relative frequency endpoints.
Points to Remember
This page last modified on 24 February 2005.