Python lab on statistical analysis

From QMC

Jump to: navigation, search

Contents

[edit] Getting Started

In order to use the software for the lab, several environment variables need to be set up. This can be done by typing

source /afs/ictp/public/shared/smr1929/login.cshrc

Note: the environment variables will only be set in the terminal window in which you type this command.

Verify that everything is working by typing

TestPlot

This should produce a simple plot of math.

Next, let's get the sample data and skeleton code into your home directory:

setup_stats

This will create a Statistics subdirectory in your current directory and copy some files into it. You can now cd Statistics and begin the tutorial.

[edit] Doing Statistics

To begin with, we will write some code that will do statistical analysis. For this section, we will work the the data set test.dat, which should be in your Statistics directory. Save this file in the same place you will be working on your python script. Once you have working code, we will use this code later to analyze results from your own variational monte carlo code.

In this section, we will use python. Start with the template in CalcStatistics.py. There are a number of blank functions in this file which you will fill in.

Let's start by figuring out how to take the average of our data. (Throughout this tutorial we will use the library numpy for arrays) It turns out that numpy has an average function that we can call.
data_avg = numpy.average(myArray)

If you haven't used python before, though, it's a good idea to get started by writing one yourself). Let's walk through what you need to know in order to do this.

First in order to define a function we specify:

def Mean(myArray):
    for i in myArray:
        print i 
    # this prints everything in the array.
    # let's change this to average things in a
    return 0

You should notice a couple of things. To begin with, a function is defined by using the "def" keyword. The paramaters to the function don't have types (in python you almost never need to specify the types of data). Also, white-space is important in python. You must indent things (typically 4 spaces) underneath a function and underneath a "for loop". The importance of white space is somewhat unnatural to begin with, but makes the code much more readable.

There are a variety of ways to "loop" in python. The one that is most like C++ and fortran is to do

for i in range(0,len(myArray)):
    print myArray[i]

Equivalently one can do

for i in myArray:
   print i

Other useful points:

  1. Notice, that you can access the i'th element of the array by doing myArray[i]
  2. If it's a 2d array, you can access the (i,j)'th element by doing myArray[i,j]
  3. If you divide and one of your numbers is an integer, then you will get an integer result (this unfortunate behavior will go away in the next version of python). To avoid this behavior, add 0.0 onto an integer before dividing.

Using this information, write a function that calculates and returns the average of your array. The file TestMean.py reads test.dat and passes the data to your Mean function. It also calls Numpy's built-in average function to compare. After you have filled in your Mean fuction, try running

./TestMean.py

to see if your function is correct.

[edit] Equilibration (transient)

Of course, in calculating the average, we have ignored the fact that some of the data exists before the system has equilibrated. Let's start by looking at a plot of our data. The script PlotData.py plots the first 500 points in our data set:

#!/bin/env python
import numpy
import pylab
 
d = pylab.load('test.dat')
pylab.plot(d[:500])
pylab.show()

pylab (aka matplotlib) is a MATLAB-like python library that is very well suited for plotting publication quality 2d plots. You should now see a plot of the first 500 points of the data (we don't plot it all because otherwise the transient will get washed out). If the data you were averaging was sufficiently short, you would notice a difference between were the average appears on the plot and the average you calculated earlier. This is because, the first part of the data is clearly not yet equilibrated. We want to throw out that part of the data. To get only a piece of an array in python, we use slice notation calling "myArray[start:end]" (this includes the point "start" but not the point "end"). If you don't include one of these (i.e. calling myArray[start:]) it will substitute the first (respectively the last+1) point for the missing parameter). Use this information, to calculate the actual average without the equilibrated data. If you're feeling especially talented, come up with a heuristic way to take a dataset and return the value it has equilibrated at.

[edit] Standard Error

Now, we want to calculate the standard error of our dataset. In order to do this, we need to figure out the standard deviation of our dataset as well as the number of effective points. Let us start by calculating the standard deviation of our dataset. The standard deviation is defined as

math

where N is the number of points. Using the standard deviation we just calculated, the standard error can be expressed as

math

Note that we use math because the mean is determined from the same data as the variance.

Write functions

def StandardDev(myArray):
   # do stuff
   return standard_dev

and

def NaiveStandardError(myArray):
  # calculate the standard error by calling the standard deviation function
  return standard_err

that returns the standard deviation and "naive" standard error of your dataset. By this, I mean the standard error assuming that all points in your system were independent (i.e. math). As a check, the answer for this dataset should be 1.42209 FIX ME if you calculate it on the entire dataset (which is the incorrect thing to do since you should really leave out the non-equilibrated data). (Helpful note: sqrt or numpy.sqrt will give you the sqrt of a number)

Recall that the standard deviation gives, roughly speaking, the width of a normally distributed dataset. The command

./PlotHist.py

will plot a histogram of the dataset. Compare its width with your calculated value.

We would expect the standard error to be independent of the way in which samples are binned. However, we will see that this is not always the case with real data. If there is some correlation between different data points then the standard error turns out to be a function of the bin size. This is because math and math are different. Let us define kappa (math) to be math Here we will see two standard methods that we can use correct for this autocorrelation and get the correct standard error.

[edit] Binning

The binning method is a technique that let's you calculate the actual standard error by iteratively binning together sets of data. It works in the following way. We calculate the naive standard error of our dataset. Then, we "bin" the dataset by averaging together groups of k points producing a new dataset (size n/k). We then calculate the naive standard error for this new dataset. If we then plot the standard error as a function of k, we should find that the standard error eventually becomes independent of k. This asymptotic value is the real standard error. (From here you could obviously back out the value for math as well) There are two things to note about this procedure. To begin with, establishing the point at which the standard error "levels out" is typically done by looking at the plot (as opposed to automatically) although we could come up with an automatic procedure. The second thing to note is that for a large enough k, the average is over sufficiently few points that you are going to get nonsensical results. Obviously you should ignore the data at such large k.

To implement this procedure, start by writing a function

def BinData(myArray,i):
   #bin myArray into groups of size i and put the results in binnedArray
   return binnedArray

Note a couple useful things. To produce a new 1d python array (of floats) do

binnedArray = numpy.zeros((numPoints),float)

Now write a function

def BinningMethod(myArray):

that returns a list where the i'th element of the list is the naive standard error where myArray is binned into groups of i.

Then by doing

myBinnedList = BinningMethod(myArray)
pylab.plot(myBinnedList)

you can see a graph of the binning size versus the naive standard error. Look at this plot and estimate the "true error" that this asymptotes to (also back out math)

Again if you feel especially talented, write a function that uses this method to automatically find this asymptote (without actually having to graph and look at the plot)

[edit] Autocorrelation Time (optional)

Another way to get the correct standard error is to estimate math directly. This can be done by looking at the autocorrelation function

math

and integrating below it to get

math

where math is the point where C(i) goes negative.

Write a function

def c(i,myArray,mean,var):

that returns C(i). Then using pylab, write some code that graphs C(i) as a function of i. You should note that when this function goes to 0, the points are totally decorrelated (the value of math will be smaller then this because we can use partially correlated points to get a full "effective point".)

Now write a function

def CalcKappa(myArray):

that returns the autocorrelation time.

Now that you can calculate the standard deviation of your dataset and the autocorrelation time, math (and hence the number of effective points), you should be able to put these two things together to calculate the actual standard error in a function

def StandardError(myArray):

For the dataset we have supplied, you should get a result of 0.37427. Compare the results you get using this method with the results from the binning technique.

[edit] Putting it together

Now that we have python code to calculate the respective statistics, we will quickly package it together so that we can easily call it for the rest of our tutorial.

Write a function

def stats(myArray):
   return (mean,stdError)

that takes an array and returns a tuple of the mean and standard error. (In python a tuple is a set of values that are packaged together in paranthesis).

From another file (for example, the VMC one you are about to write), you would be able to call your function in the following way:

import CalcStatistics #because you saved it in CalcStatistics.py (should be at top of file)
(myMean,myError)=CalcStatistics.stats(myArray) #wherever you want to calculate statistics of your array