Quantcast
Channel: Noodling in the data stream
Viewing all 48 articles
Browse latest View live

What I’m up to

$
0
0

This will be a short post, just to keep me in the habit.

I found two new blogs I want you to know about: Normal Deviate and Econometrics by Simulation.

The week after next PolMeth XXIX is in town and Gary King is speaking on Friday, July 20, talking about robust standard errors. I'm going.

I booked my spot at Stata Conference 2012 and I have a stack of books in mind for the participant discount.

I finished the Workflow book. Chapter 5 will change my life, I can tell.

Anybody writing the book version of the Mata Missing Manual yet? I volunteer to read galley proofs, test code.


Pushing circles around

$
0
0

Occasionally, a need comes up for drawing a circle. Say you have a scattershot of points that follow a bivariate normal distribution, and you want to illustrate being within a given radius inside that scattershot somewhere. You will want to draw a circle so you can use its functional form to isolate the dots that fall inside it, as well as help guide your eye.

Nick Cox already explained how to draw a unit circle here. The code below shows how you can start from there and draw any other circle.

clear all
set more off
pause on

// A circle starts like this:
// 1. unit circle: x^2+y^2=1
// 2. circle of radius r centered in origin: x^2+y^2=r^2
// 3. circle of radius r centered at (a, b): (x-a)^2+(y-b)^2=r^2

// OK, so 3. is the generic form. Use it to describe y:
// y-b= [+/-]sqrt(r^2-(x-a)^2)
// y  =b[+/-]sqrt(r^2-(x-a)^2)

// Now you can draw circles, one half at a time
// as Nick Cox explained it in the Stata Journal,
// using the twoway function graph:
// http://www.stata-journal.com/sjpdf.html?articlenum=gr0010

// If you want more than one circle in a picture,
// it's worth doing some setting up first.

// E.g., formula components with values to be
// filled in later can be written once here:

// 1. the square-root part
local sqrt    sqrt(\`r'^2-(x-\`a')^2)

// 2. functions for the top and bottom semicircles 
local topf    function y=\`b'+\`sqrt', range(\`rl' \`rr') \`c'
local bottomf function y=\`b'-\`sqrt', range(\`rl' \`rr') \`c'

// An example of a circle drawn in the 1st quadrant
local a=1
local b=1
local r=.5
local rl=`a'-`r' // x [r]ange [l]eft end
local rr=`a'+`r' // x [r]ange [r]ight end
local c color(red)
local circle1 (`topf' aspect(1)) (`bottomf')

// An example of a circle drawn in the 3rd quadrant
local a=-1
local b=-1
local r=.5
local rl=`a'-`r'
local rr=`a'+`r'
local c color(blue)
local circle2 (`topf') (`bottomf')

// Reference case: the unit circle
local a=0
local b=0
local r=1
local rl=`a'-`r'
local rr=`a'+`r'
local c color(black)
local circleu (`topf') (`bottomf')

// Might as well string them all together
local circles `circle1' `circle2' `circleu'

// Draw the picture
twoway `circles', legend(off)

The R library folder on a Mac

$
0
0

How do you update your R? I mean the whole thing, not just the base packages.

R comes with a library folder where packages go -- both the base packages and the user-written ones you install over time. I'm not entirely sure what would be the best place for this folder so updates to all packages are as easy as possible.

It gets worse. On a Mac you have not one, but n+2 library folders. One is /Library/Frameworks/R.framework/Resources/library. Let's call it (1). The other n+1 are as many as there are historic versions of R on your computer, plus one. All are in /Library/Frameworks/R.framework/Versions/. In my case, they are as follows: n of them are in X/Resources/library, where X is 2.13, 2.14, and 2.15; one is in Current/Resources/library.

I know that the three library folders in X=2.15, Current and (1) are identical and updated simultaneously, though only the first shows up when I call .libPaths(). But when you update say, from 2.15 to 2.16, your historic versions remain frozen and the new (1) and Current folders reflect only 2.16. You have to install non-base packages all over again, by hand. Or do you? Is there a way to script an update of everything in the library folder at version t-1 when you update your R to version t? If anybody knows, please drop a comment. Thank you.

My MOOC habit

$
0
0

I can't seem to stay away from all this great online education, and the problem worsens as supply expands and prices fall. Back in 2008 I was paying NC State about $400 a pop for some CS courses that I took online there. Now there's Coursera, edX, Udacity, P2PU and Caltech, all free.

For some of us, it's a growing habit, as confessed in the last paragraph here. But scattered evidence so far suggests that a majority of MOOC consumers are dabblers -- a bit like president Obama, who may have done a bit of blow but didn't quite become a coke head. Below is a picture that supports this view.

The Learning From Data class at Caltech consists of 18 lectures, offered both over YouTube and iTunes U. I did both. If I knew I would be watching on a plane or in a hotel room, I downloaded them ahead of time to iTunes. Otherwise, I just watched them on YouTube. You can count the views for each video and of course they will change over time, but I think the general trend is right: a lot of people watch the first lecture, then peter out, with some peaks that may well be random, but my guess is that they are at specific topics visited by people who use this class as another reference.

My own experience may be typical. Though I took at least two online courses per year since the fall of 2008, I signed up for more. I have both signed up for courses I never took, and started courses I never finished. What can I say? It's a feast out there. I am grateful to the hosts, and I hope that they will get lots of good guests with the necessary youth and stamina. I mean, if I were an unemployed millennial, I'd occupy this.

Benchmarks

$
0
0

I went googling for some examples of quadratic programming done in Mata, and stumbled across a fairly recent Statalist discussion. The original question is here and the official response, typically prompt, is here. I tested Patrick Roland's code on my own machine (2011 MacBook Pro Core2 i5) but with Octave instead of MATLAB, and with R in addition. Octave took about 2 seconds. My R code is


system.time(chol2inv(matrix(rnorm(2000^2),2000,2000)))

This took about 4 seconds to run, whether in RStudio or in command-line R 2.15.1. Mata, meanwhile, still takes about 30 seconds. I run Stata 12MP, all up to date. I'd be curious how SAS/IML does. I don't have it.

Setting up my R library folder on a Mac

$
0
0

My understanding is that there are three kinds of R packages: base, recommended, and everything else. You can tell which is which by inspecting the output of installed.packages(). That is easiest done in RStudio by sending that output to a data frame, like this


packs 

You can see that this data frame has a column named Priority. The output of table(packs$Priority, exclude=NULL) shows that I have 14 base packages, 15 recommended ones, and 70 of the other kind -- user-contributed kit that I installed over time as I bumbled my way through learning and using R.

Looking at packs in the top left pane of Rstudio also shows that the rows are named after the packages. This means that you can collect the names of base and recommended packages easily:


> names(subset(packs, Priority=="recommended")$Package)
 [1] "boot"       "class"      "cluster"    "codetools"  "foreign"    "KernSmooth"
 [7] "lattice"    "MASS"       "Matrix"     "mgcv"       "nlme"       "nnet"      
[13] "rpart"      "spatial"    "survival" 

Having all the R packages in the same default library, which is /Library/Frameworks/R.framework/Versions/2.15/Resources/library as of this writing, comes with the disadvantage that when I upgrade to the next version of R I will have to re-install the 70 packages of the third kind.

It would be nice if I could set them aside in a different library, and any future version of R will know to look for them there and update them as needed.

There are two steps to this job: one is the actual moving of package folders; the other is to show R where to look for them.

First, I created a new folder called Rlibs. Then I moved the folders around with this Bash script, which I called movePacks.sh:


#!/bin/bash

# declare an array with the names of the base packages
basepacks=("base" "compiler" "datasets" "graphics" "grDevices" "grid" "methods" "parallel" "splines" "stats" "stats4" "tcltk" "tools" "utils")

# and another with the names of the recommended ones
recpacks=("boot" "class" "cluster" "codetools" "foreign" "KernSmooth" "lattice" "MASS" "Matrix" "mgcv" "nlme" "nnet" "rpart" "spatial" "survival")

# and now concatenate them
allpacks=("${basepacks[@]}" "${recpacks[@]}")

# where you're moving from
oldLib="/Library/Frameworks/R.framework/Versions/2.15/Resources/library"

# where you're moving to
newLib="/Users/ghuiber/Rlibs"

# first, move everything over
mv ${oldLib}/* ${newLib}

# then move the base and recommended packages back to their default location
for i in "${allpacks[@]}"
do
   mv ${newLib}/${i} ${oldLib}
done

Finally, to point R to the library folders, I created this .Renviron file as instructed by Christophe Lalanne in the comments to my earlier post on the topic:


R_PAPERSIZE=letter
R_LIBS=/Users/ghuiber/Rlibs
EDITOR=vim

The ideas for the R_PAPERSIZE and EDITOR environment variables came from here.

I’m taking intro to biostats and epi (PH207x) from EdX

$
0
0

So far, it's been great fun. It's the first MOOC I saw where the software used is Stata, and I would not be surprised if this were a first among all commercial software packages. The topics covered and quality of the instruction are excellent. I am glad to see Stata introduced to a large audience in such nice company. StataCorp made free temporary licenses available to all registered students worldwide for the duration of the course.

But enough about Stata. What really blew me away was the textbook for the biostats section. Beautifully written, it takes its time to cover properly everything you need to know about hypothesis testing if you're an applied researcher. Most texts I've seen before hurried through this part like they couldn't wait to jump into regression diagnostics and the like. Maybe it's because I've only seen econometrics texts before.

Anyway, buy it if you're looking for an introductory text for applied stats of any kind, and biostats in particular. I'm not getting a penny for this plug, by the way, so feel free to try and find it for less elsewhere. And take the course when they offer it again. At the very least, it will make you an educated consumer of public health information.

Tidying up your R packages

$
0
0

Do you have the same R packages installed in two places? Would you like to remove the duplicates? You might find the script below useful:


rm(list=ls(all=TRUE))

# define function to return duplicate packages and paths
tidyup 

Why I wrote this:

A while back I chose to separate my package library over two file paths. One would be for base and recommended packages (1), the other for everything else (2). My notes on how I did that are here, and my reasons are here.

Today, I wanted to update my Zelig. I used the wizard -- source("http://r.iq.harvard.edu/zelig.installer.R") -- so I would get all the add-ons in one step. The wizard works under the assumptions that your library is all in one place. It installed a few packages that Zelig and its add-ons depend on on path (2), because it didn't find them there. They were present on path (1) though, so I ended up with duplicates. This is how I got rid of them.


An R-squared for logistic regression, packaged

$
0
0

This morning I checked Paul Allison's Statistical Horizons blog and found a post on R^2 measures for logistic regression. It introduced me to Tjur's R^2 by way of an example, which I repackaged below:


// Reference: http://www.statisticalhorizons.com/r2logistic

// program definition
capture prog drop tjur2
program tjur2, rclass

if !inlist(e(cmd),"logit","logistic") {
   di as err "Tjur's R-squared only works after logit or logistic."
   exit 468 // Thank you, Nick Cox.
}
tempname yhat
predict `yhat' if e(sample)
local y `e(depvar)'
quietly ttest `yhat', by(`y')
local r2logistic r(mu_2)-r(mu_1)
di "Tjur's R-squared " _col(20) %4.3f `r2logistic'
return local r2logistic `r2logistic'

end

// use case
use "http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta", clear
logistic inlf kidslt6 age educ huswage city exper
tjur2

I'm not sure yet if it's worth saving this program as ado/personal/t/tjur2.ado for my future logistic regression diagnostic needs, but I haven't posted anything Stata-related in too long, so there you have it.

A quick note on rJava

$
0
0

I recently had to set up a PC with similar kit as I have on my Mac. On this PC the OS is Windows 7 64-bit but the browser is IE8 32-bit. This causes jucheck.exe to install (and occasionally update) 32-bit Java. This is unfortunate if you use 64-bit R, because it breaks the rJava package, which in turn breaks the xlsx package, with the practical consequence that you cannot read Excel worksheets into R. There is a workaround.

First, install Oracle's manual download of 64-bit Java. As of this writing, its Windows 7 home will be in C:\Program Files\Java\jre7. You should add this to the %path% environment variable. In addition, the rJava package depends on jvm.dll, and R might be looking for it in the wrong spot. It won't hurt, then, to add this to your %path% as well: C:\Program Files\Java\jre7\bin\server. There's more on this, as usual, on StackOverflow.

As Oracle warns, your manually-installed 64-bit Java will not be automatically updated. That is a problem when security flaws hit Java, but I find being able to read Excel files into R so useful that I'm willing to just live with this risk, though I don't have a good idea how to best manage it. I'll just keep an eye on ArsTechnica for bug news. If anybody has a better way, I'm all ears.

Keeping knitr happy after upgrading to R 3.0.0

$
0
0

As noted here, after upgrading to R 3.0.0 you must run


update.packages(checkBuilt=TRUE)

This is because a bunch of packages have to be to rebuilt under R 3.0.0 in order to keep working.

So I did, but that was not enough for LyX to be able to compile my pdf's from knitr like it used to only a week ago. What I had to do besides was this:


remove.packages("tikzDevice")
install.packages("/Users/ghuiber/Downloads/tikzDevice_0.6.3.tar.gz", repos = NULL, type="source")

That is right. The package tikzDevice can no longer be installed directly from R-forge as a binary, as in

install.packages("tikzDevice", repos="http://R-Forge.R-project.org")


Also, the source files are only available as a .tar.gz archive. To install from it on a Windows machine, you must have Rtools installed first.

Stata 13 is coming on June 24

$
0
0

Yellow color scheme is out, sky-blue is in, plus expanded capabilities, as one might expect. Notable among them, xtologit, xtoprobit and long strings -- 2 billion character long, that is. One of these days you won't need an RDBMS anymore. Wouldn't that be nice?

See more details here.

I put up my first post on RPubs

$
0
0

Sure, it may be the 4chan of data analysis, but it's so nice to be able to do R Markdown right there in RStudio and just hit the Publish button.

Of course, this convenience has downsides. I know it's prudent to sit with your work a bit, just like thinking carefully before you go skinny-dipping, especially when you don't have the benefit of peer review.

On the other hand, it's no use to wait until nobody cares anymore. So, here goes.

How I backed up a bunch of old pictures to Amazon Glacier

$
0
0

This is from a home server that runs Fedora 14, to which I have ssh access from my MacBook Pro.

1. I git clone'd this.

2. Then, as super-user, I called


wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python

as instructed here, to install the setuptools module.

3. Then, also as super-user, I called


python setup.py install

4. At this point, it was time to fill out the .glacier-cmd configuration file, as shown in the README.md.

5. Bookkeeping using Amazon SimpleDB requires setting up an Amazon SimpleDB domain (= database) first. You cannot do this through the AWS Management Console.

6. So I googled, and found official directions here.

7. Unfortunately, my Chrome wouldn't render properly the SimpleDB Scratchpad web app. That caused some unnecessary confusion. The solution was to just run Scratchpad in Safari.

8. Your computer has folders and files. Amazon Glacier has vaults and archives. One archive = one upload. This can be an individual file, but it's more practical to bundle individual files into tarballs first, so one archive = one tarball.

9. I'm in business: two large tarballs uploaded and showing up in my SimpleDB domain that keeps tabs on this particular vault, one on the way.

It looks like everything works, but I can't be sure until Amazon Glacier gets around to producing an inventory (this happens about once a day, it seems). I can then check SHA sums between what's on Glacier and what I thought I sent there. Next I will upload something small, then download it the next day.

Glacier is the digital equivalent of self-storage. You put stuff there that you don't really want anymore; you think you might, but you don't. It's a problem that comes with ease of acquiring such stuff in the first place. I don't think there's a big self-storage industry in Zambia, and I'm sure that storing old photos wasn't much of a problem back when you had to take them on film and you only had 32 frames in a roll.

I have no idea why we bother with digital self-storage. I guess simply deleting old pictures and a bunch of music we no longer listen to makes us feel like jerks. It's a total trap.

Invisible methods

$
0
0

R objects come with various methods that make them useful. I tend to stumble over these by googling something I want to do, and finding some code example on StackOverflow. But today I learned (from @RLangTip) that there is a straightforward way to list them all: you simply call e.g., methods(class='lm').

That's nice, but mileage varies and I don't have a good explanation for it. Take Zelig for example. It has this sim() function which produces a simulation object with some methods of its own. One of these is plot.ci(), illustrated here. Unfortunately, you won't find it with the methods() call:


> library("Zelig", lib.loc="C:/Program Files/R/library")
Loading required package: boot
Loading required package: MASS
Loading required package: sandwich
ZELIG (Versions 4.2-2, built: 2013-10-22)

+----------------------------------------------------------------+
|  Please refer to http://gking.harvard.edu/zelig for full       |
|  documentation or help.zelig() for help with commands and      |
|  models support by Zelig.                                      |
|                                                                |
|  Zelig project citations:                                      |
|    Kosuke Imai, Gary King, and Olivia Lau.  (2009).            |
|    ``Zelig: Everyone's Statistical Software,''                 |
|    http://gking.harvard.edu/zelig                              |
|   and                                                          |
|    Kosuke Imai, Gary King, and Olivia Lau. (2008).             |
|    ``Toward A Common Framework for Statistical Analysis        |
|    and Development,'' Journal of Computational and             |
|    Graphical Statistics, Vol. 17, No. 4 (December)             |
|    pp. 892-913.                                                |
|                                                                |
|   To cite individual Zelig models, please use the citation     |
|   format printed with each model run and in the documentation. |
+----------------------------------------------------------------+



Attaching package: ‘Zelig’

The following object is masked from ‘package:utils’:

    cite

> methods(class='sim')
[1] plot.sim*   print.sim*   repl.sim*   simulation.matrix.sim*
[5] summary.sim           

   Non-visible functions are asterisked

See that? There's a non-visible plot() method listed, but no plot.ci() method, yet it exists and it works. I wonder why that is. Is it maybe that plot.ci() is some kind of child of plot()? If so, how do you list such children?


FreeNAS works as advertised

$
0
0

I decided to replace the HDD with a SSD in my Mac for Christmas, but I only got as far as buying the thing and backing up the computer using Time Machine as explained here to a poor man's FreeNAS server that I cobbled together from a USB stick (for the OS) and the old Fedora 14 home server whose sole 500G HDD is now one big ZFS volume, with 2G of RAM.

That's right, ZFS on one HDD with 2G of RAM. I'm not saying that this is a good setup. The official hardware recommendation is 8G for ZFS. But this is the kit I had lying around, and I just wanted to move on with the actual disk replacement; my D510MO board won't even support more than 4G of RAM (though I'm not sure why, since it was made to accommodate a 64-bit CPU). Anyway, I managed to make one first complete Time Machine backup and a few incremental ones before leaving for work on Monday, January 6.

I flew back on Thursday, January 9, and found a non-responsive Mac with a HDD so sick that an erase-and-install OS restore was in order. That's what you end up having to do when upon entering your password at boot-up you see the apple logo for a while, then that "prohibited access" barred circle, while the gear animation is spinning and spinning.

I have no idea how this happened. I felt very fortunate for having made that backup. I decided that the accident was a good excuse to just proceed with the SSD installation already.

The proof of this particular pudding was going to be in restoring the old system from that Time Machine backup, over the LAN, off the grossly inadequate NAS box. I am happy to report that the restore succeeded, and my Mac is back in business, now with a SSD.

What I'm saying is this: if you don't have a Time Capsule but do have some idle hardware, FreeNAS may be a good Time Machine backup solution for you too.

One thing you will want to know about is user quotas: a 500G NAS HDD will fill up quickly if you let Time Machine have its way with it. The solution is to set some reasonable user quotas for people in your house who might use the FreeNAS box as their Time Machine backup destination. You can do that from the web GUI. The Advanced Mode of the Create ZFS Dataset menu under Storage (or, for an existing dataset, the Advanced Mode of Edit ZFS Options) lets you set quotas four different ways; for specifics, google thin and thick provisioning. This seems to be advanced sysadmin stuff.

There is also a command-line recipe for setting user quotas here. You get to the FreeNAS shell from the web GUI: look at the bottom of the vertical navigation menu on the left.

Smaller quotas will force Time Machine to keep a shorter history. It deletes old backups as it runs out of space -- so, less room, shorter history. That is not a bad thing.

Macbook Pro running hot, draining battery after upgrading to SSD?

$
0
0

Mine did. That was an unpleasant surprise. Googling for a solution brought up untold amounts of speculation and wasted time.

What ended up working for me was resetting the System Management Controller (SMC), as documented here and especially here. You should see that Reddit comment thread, especially if you're also wondering whether you're supposed to enable TRIM.

Resetting the SMC brought down the CPU core temperatures from about 90°C to about 60°C, low enough for the fan to not kick in. My Mac is once again as quiet as it used to be.

Some unresolved hiccups with R 3.1.0 on Mavericks, and a workaround

$
0
0

If you're going to download the Mac binaries for the latest R, you will see that they come in "Snow Leopard and higher" and "Mavericks and higher" flavors. If you run Mavericks, the latter is a natural choice, though the former clearly says "and higher" too, so it's got to be a valid option as well.

As it turns out, it's the better option, at least as of this writing.

The Mavericks build crashes with a segmentation fault upon attempting to load either the caret or data.table library, as reported here and here. A brief search through the R-SIG-Mac Archives returned no useful leads for fixing the problem.

Dropping the Mavericks build and installing the Snow Leopard one gave me back both caret and data.table. This works for me.


> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.1   bitops_1.0-6     scales_0.2.4     ggplot2_1.0.0    reshape_0.8.5    data.table_1.9.2
[7] MASS_7.3-31     

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     grid_3.1.0       gtable_0.1.2     htmltools_0.2.4  munsell_0.4.2   
 [7] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4     rmarkdown_0.2.46 stringr_0.6.2   
[13] tools_3.1.0  

Introducing syncR

$
0
0

A new and improved version of the syncPacks() function is now part of a GitHub package, which you can install through devtools::install_github('ghuiber/syncR'). If you're into that, you can help develop it further too.

Thanks go to Hilary Parker for her thorough instructions and to Hadley Wickham for devtools and roxygen2.

Here’s to MOOC’s. They’re better than textbooks

$
0
0

The job of textbooks is to separate brilliance, which has zero marginal cost, from individual attention, which is labor-intensive. Everybody is better off when the few brilliant teachers write books that the many dedicated ones can teach from.

MOOC's do the same job better. They are cheaper to make and distribute. They are cheaper to improve on, because student response is automatic and quite precise: all you have to do is look for videos most rewound, or quiz answers most missed. Improvements can be spliced in as needed, one four-minute video replacing another. MOOC's are also much better at avoiding bloat. Textbooks grow thicker and more colorful over time, driven by relentless yearly print runs. It is not clear how much of this reflects truly new content, more effective delivery, or the need to kill off the resale market. With MOOC's, there is no such uncertainty. The resale market is not a concern. Courses that are not watched will be abandoned. Lectures rewound a lot or whose accompanying quizzes have low pass rates will be re-shot, improved. And videos preserve the kind of author's flair for delivery that is lost on the printed page no matter how colorful the latest version is, or how interactive the accompanying website.

Many trees are felled for making textbooks that are returned to the publisher. While they are out, they are clutter that makes it hard to find the good ones: they're all equally thick and colorful and pushed by equally enthusiastic reps. MOOC's, on the other hand, produce all kinds of vital statistics -- viewership, attrition rate, forum participation, topics most discussed, etc. -- as soon as they go live. They are easy to kill off if they don't catch on and it does not take long to know whether they might. MOOC's may look like a monoculture, but what looks like diversity in textbooks is just market inefficiency.

MOOC's don't work that well on their own for the same reason that textbooks don't: both are complements to individual attention, not substitutes for it. But MOOC's paired with a flipped classroom will do a better job than textbooks paired with a reading schedule have done so far. Thanks to them the workers of the future will be more productive than we are.

Viewing all 48 articles
Browse latest View live




Latest Images