I found time and read Gelman and Hill’s “Data Analysis Using Regression and Multilevel / Hierarchical Models“…Now, please do yourself a favour and get it (of course the paperback version
). Even for experienced or intermediate (myself) this will be a treat for your eyes and neurons.
Posts archived in infos
From the book website :
IPSUR stands for Introduction to Probability and Statistics Using R,
ISBN: 978-0-557-24979-4, which is a textbook written for an
undergraduate course in probability and statistics. The approximate
prerequisites are two or three semesters of calculus and some linear
algebra in a few places. Attendees of the class include mathematics,
engineering, and computer science majors.
Now, there is a new way to read R books (anyway, new to me!)
1 2 3 | install.packages("IPSUR", repos="http://cran.r-project.org") library(IPSUR) read(IPSUR) |
I was flicking thru some papers I have printed in the last years and definetely Breiman’s is one of my favorite. A pragmatic and insightful reading for a new statistician (or data analyst if you prefer;)).
As I left consulting to go back to the university,these were the perceptions I had about working with data to find answers to problems:
(a) Focus on finding a good solution—that’s what consultants get paid for.
(b) Live with the data before you plunge into modeling.
(c) Search for a model that gives a good solution, either algorithmic or data.
(d) Predictive accuracy on test sets is the criterion for how good the model is.
(e) Computers are an indispensable partner.
Read more
Leo Breiman (2001), Statistical modeling: The two cultures, Statistical Science, 16:199-231 [pdf]
Recently I was in need of testing a mean vector. I wrote a few lines of code in R and had it done perfectly. Hotelling test is one of the least interesting test to me. never really figured out why…
At that time I had some time to search more about it. One of the most common things to search for a test is a robust version of it (at least that’s what I search for!). A little search in the 3rd page of google results leads to the following :
One-sample and two-sample robust Hotelling tests with fast and robust bootstrap
The classical Hotelling test for testing if the mean equals a certain value or if two means are equal is modified into a robust one through substitution of the empirical estimates by the MM-estimates of location and scatter. The MM-estimator, using Tukey’s biweight function, is tuned by default to have a breakdown point of 50% and 95% location efficiency. This could be changed through the control argument if desired.
Robust Hotelling T2 test
Performs one and two sample Hotelling T2 tests as well as robust one-sample Hotelling T2 test.
The first uses MM and S estimators while the latter a Minimum Covariance Determinant one. You can get info on those on the links in the end of the post. What might be crucial to you is that MM/S estimators would be more time comsuming compared to MCD. A little demonstation is the following.. Read the rest of this entry »
There is a central notion in Time Series Econometrics, cointegration. Loosely it refers to finding the long run equilibrium of two non-stationary series. As the most know non-stationary series examples comes from finance, cointegration is nowadays a tool for traders (not a common one though!). They use it as the theory behind pairs trading (aka Statistical Arbitrage).
In the following lines we use a simple pairs trading technique, studying the ratio of the two price evolution series. We use the New York versions of two of the greatest players in ATHEX, NBG and OTE.
stock <- "NBG"
stock1 <- "OTE"
start.date <- "2003-10-20"
end.date <- Sys.Date()
quote <- paste("http://ichart.finance.yahoo.com/table.csv?s=",
stock,
"&a=", substr(start.date,6,7),
"&b=", substr(start.date, 9, 10),
"&c=", substr(start.date, 1,4),
"&d=", substr(end.date,6,7),
"&e=", substr(end.date, 9, 10),
"&f=", substr(end.date, 1,4),
"&g=d&ignore=.csv", sep="")
quote1 <- paste("http://ichart.finance.yahoo.com/table.csv?s=",
stock1,
"&a=", substr(start.date,6,7),
"&b=", substr(start.date, 9, 10),
"&c=", substr(start.date, 1,4),
"&d=", substr(end.date,6,7),
"&e=", substr(end.date, 9, 10),
"&f=", substr(end.date, 1,4),
"&g=d&ignore=.csv", sep="")
dataNBG.l <- read.csv(quote, as.is=TRUE)
dataOTE.l <- read.csv(quote1, as.is=TRUE)
X2=dataOTE.l[order(dataOTE.l$Date),];Y2=dataNBG.l[order(dataNBG.l$Date),]
Inspired from a mail that came along the previous random generation post the following question rised :
How to draw random variates from the Von Mises distribution?
First of all let’s check the pdf of the probability rule, it is $$ f(x):=\frac{e^{b \text{Cos}[y-a]}}{2 \pi \text{BesselI}[0,b]}$$, for $$-\pi \leq x\leq \pi $$.
Ok, I admit that Bessels functions can be a bit frightening, but there is a work around we can do. The solution is a Metropolis algorithm simulation. It is not necessary to know the normalizing constant, because it will cancel in the computation of the ratio. The following code is adapted from James Gentle’s notes on Mathematical Statistics .
n <- 1000
x <- rep(NA,n)
a <-1
c <-3
yi <-3
j <-0
i<-2
while (i < n) {
i<-i+1
yip1 <- yi + 2*a*runif(1)- 1
if (yip1 < pi & yip1 > - pi) {
if (exp(c*(cos(yip1)-cos(yi))) > runif(1)) yi <- yip1
else yi <- x[i-1]
x[i] <- yip1
}
}
hist(x,probability=TRUE,fg = gray(0.7), bty="7")
lines(density(x,na.rm=TRUE),col="red",lwd=2)
There was a post here about obtaining non-standard p-values for testing the correlation coefficient. The R-library
SuppDists
deals with this problem efficiently.
library(SuppDists)
plot(function(x)dPearson(x,N=23,rho=0.7),-1,1,ylim=c(0,10),ylab="density")
plot(function(x)dPearson(x,N=23,rho=0),-1,1,add=TRUE,col="steelblue")
plot(function(x)dPearson(x,N=23,rho=-.2),-1,1,add=TRUE,col="green")
plot(function(x)dPearson(x,N=23,rho=.9),-1,1,add=TRUE,col="red");grid()
legend("topleft", col=c("black","steelblue","red","green"),lty=1,
legend=c("rho=0.7","rho=0","rho=-.2","rho=.9"))</pre>
This is how it looks like,

Now, let’s construct a table of critical values for some arbitrary or not significance levels.
q=c(.025,.05,.075,.1,.15,.2)
xtabs(qPearson(p=q, N=23, rho = 0, lower.tail = FALSE, log.p = FALSE) ~ q )
# q
# 0.025 0.05 0.075 0.1 0.15 0.2
# 0.4130710 0.3514298 0.3099236 0.2773518 0.2258566 0.1842217
We can calculate p-values as usual too…
1-pPearson(.41307,N=23,rho=0)
# [1] 0.0250003
One of the most common exersices given to Statistical Computing,Simulation or relevant classes is the generation of random numbers from a gamma distribution. At first this might seem straightforward in terms of the lifesaving relation that exponential and gamma random variables share. So, it’s easy to get a gamma random variate using the fact that
$$ {{X}_{i}}\tilde{\ }Exp(\lambda )\Rightarrow \sum\limits_{i}{{{X}_{i}}}\tilde{\ }Ga(k,\lambda )$$.
The code to do this is the following
rexp1 <- function(lambda, n) {
u <- runif(n)
x <- -log(u)/lambda
}
rgamma1 <- function(k, lambda) {
sum(rexp1(lambda, k))
}
This works unfortunately only for the case $$ k\in \mathbb{N}$$.
Read the rest of this entry »
It’s π-day today so we gonna have a little fun today with Buffon’s needle and of course R. A well known approximation to the value of $latex \pi$ is the experiment tha Buffon performed using a needle of length,$latex l$. What I do in the next is only to copy from the following file the function estPi and to use an ergodic sample plot… Lame,huh?
estPi<- function(n, l=1, t=2) {
m <- 0
for (i in 1:n) {
x <- runif(1)
theta <- runif(1, min=0, max=pi/2)
if (x < l/2 * sin(theta)) {
m <- m +1
}
}
return(2*l*n/(t*m))
}
So, an estimate would be…
Read the rest of this entry »

LinkedIn
Facebook
Youtube
Twitter