May 7, 2010

Knowing whether a time-series has been differenced appropriately in order to make it stationary

Hello everybody,

Today I would like to make you learn a simple method (and of-course using R) how to identify whether a time-series has been differenced appropriately while making it stationary.

Suppose, you have made a series stationary by differencing it, now in order to know whether it is neither over nor under differenced subject the current series against next level differenced series either using a Regression or ARIMA having Constant/Intercept. Next, from the results obtained gather either Akaike-Information-Criterion value (AIC) or Root-Mean-Squared value (RMSE), if AIC (or RMSE) value from current series is lower than next level differenced series than one can conclude that current series is appropriately differenced to make it stationary.

In R, it can be done by scripting following two commands:
arima("series to be test",c(0,0,0));  # first current series with constant)
arima("series to be test",c(0,1,0));  # next level differencing or one more lag difference of the current series with
constant)

In SAS, it can be identified by using IACF plots of PROC ARIMA.

Regards,
besteconometrician@gmail.com

Feb 10, 2010

Easy way of determining number of lines/records in a given large file using R

Dear Readers,

Today I would like to post the easy way of determining number of lines/records in any given large file using R.

Directly to point.

1) If data set is small let say less than 50MB or around in R one can read it with ease using:
length(readLines("xyzfile.csv"))

2) But if data set is too large say more than 1GB then reading through R throws the memory limit problem, since R takes all the records into memory and outputs the requested.

3) So, how to determine number of lines for large data set without getting into memory problems.

a) First for let's say of size about half GB or one million records/observations (assuming you are having 2GB RAM on your PC) the below code easily determine number of records with no memory related errors:

testcon <- br="" file="" open="r" xyzfile.csv=""> readsizeof <- 20000="" br=""> nooflines <- 0="" br=""> ( while((linesread <- length="" readlines="" readsizeof="" testcon=""> 0 )
nooflines <- br="" linesread="" nooflines=""> close(testcon)
nooflines

b) Next, even for size larger than half GB one can determine the number of records by bzipping the file and running the code as follows:
testcon <- br="" file="" open="r" xyzfile.csv.bz2=""> readsizeof <- 20000="" br=""> nooflines <- 0="" br=""> ( while((linesread <- length="" readlines="" readsizeof="" testcon=""> 0 )
nooflines <- br="" linesread="" nooflines=""> close(testcon)
nooflines

Second method has an advantage of disk space efficiency R from 2.10 version can
directly read zip files.

Thus, from next time wish readers will follow these easy method.

Have a nice programming with R. Author can be reached at mavuluri.pradeep@gmail.com.