Calculating Sample Size

% confidence level

50

68

90

95

99

% confidence interval (e)

0.26

% accuracy

0.795

data size (population)

486000000

calculated sample size	10

Statistically, 10% of a population is enough to estimate the survey results of 100%. But if you have a huge dataset, such as 1 billion records, instead of looking at 10% of the population (which is still large), you can look for the optimal (minimum) amount of data to survey.

This standard equation defines the appropriate sample size (

SS

) of people to use for a survey:

SS=

2

Z

P(1-P)

2

e

1+

2

Z

P(1-P)

2

e

N

.

It is very common to use this equation for population sizes of big data projects in order to define the appropriate sample of data that should be analyzed.

The parameters to define the sample size are:

Confidence level

Z

: the precision required for the survey

Confidence interval

e

: the error tolerance for the survey,

-0.04≤e≤0.4

Accuracy

P

: the data quality or trustworthiness of the information in the data

Data size

N

: the total population (or number of records in the database)