Quantcast
Channel: Noodling in the data stream

Introducing syncR

$
0
0

A new and improved version of the syncPacks() function is now part of a GitHub package, which you can install through devtools::install_github('ghuiber/syncR'). If you're into that, you can help develop it further too.

Thanks go to Hilary Parker for her thorough instructions and to Hadley Wickham for devtools and roxygen2.


Here’s to MOOC’s. They’re better than textbooks

$
0
0

The job of textbooks is to separate brilliance, which has zero marginal cost, from individual attention, which is labor-intensive. Everybody is better off when the few brilliant teachers write books that the many dedicated ones can teach from.

MOOC's do the same job better. They are cheaper to make and distribute. They are cheaper to improve on, because student response is automatic and quite precise: all you have to do is look for videos most rewound, or quiz answers most missed. Improvements can be spliced in as needed, one four-minute video replacing another. MOOC's are also much better at avoiding bloat. Textbooks grow thicker and more colorful over time, driven by relentless yearly print runs. It is not clear how much of this reflects truly new content, more effective delivery, or the need to kill off the resale market. With MOOC's, there is no such uncertainty. The resale market is not a concern. Courses that are not watched will be abandoned. Lectures rewound a lot or whose accompanying quizzes have low pass rates will be re-shot, improved. And videos preserve the kind of author's flair for delivery that is lost on the printed page no matter how colorful the latest version is, or how interactive the accompanying website.

Many trees are felled for making textbooks that are returned to the publisher. While they are out, they are clutter that makes it hard to find the good ones: they're all equally thick and colorful and pushed by equally enthusiastic reps. MOOC's, on the other hand, produce all kinds of vital statistics -- viewership, attrition rate, forum participation, topics most discussed, etc. -- as soon as they go live. They are easy to kill off if they don't catch on and it does not take long to know whether they might. MOOC's may look like a monoculture, but what looks like diversity in textbooks is just market inefficiency.

MOOC's don't work that well on their own for the same reason that textbooks don't: both are complements to individual attention, not substitutes for it. But MOOC's paired with a flipped classroom will do a better job than textbooks paired with a reading schedule have done so far. Thanks to them the workers of the future will be more productive than we are.

Recipe for pairing up RStudio with GitHub

$
0
0

On both Windows and Mac I have been happy to use RStudio for R development and the GitHub app for handling version control. The app seems to be GitHub's own preferred interface, and if you use it you don't even need Git for Windows. I'm not sure why you wouldn't just do that. The only cost is that you have to flit between RStudio and the GitHub app every time you make a commit, but how much of an interruption is that? You flit between RStudio and the browser all the time to check StackOverflow, don't you?

Regardless, suppose that you find the thought of doing your version control from inside RStudio appealing. Below are the setup steps that worked for me, pieced together from many places in the process of integrating my startUpViz repo into RStudio's Git workflow.

Step 1: Give RStudio the Git

Install Git for Windows or, if it's installed already, tell RStudio about it as explained in the five-step method described here. Make sure to stop at "Restart RStudio and that is all there is to it!" My instructions below supersede the remaining instructions there, not because those are wrong, but because I have a specific kind of R project in mind: a package.

Step 2: Create a brand new R project

Create a brand new R project in a brand new directory and check the box "Create a Git repository." You might as well place this at the end of c:/users/[yourusername]/documents/GitHub/ because this is probably where you keep all the work that ends up published on GitHub.

Step 3: Open the Git shell and configure your SSH key pair

In RStudio your brand new project comes with a Git tab in the top right corner where you usually only pay attention to the Environment and History tabs. That Git tab has a gear icon, and the drop-down menu that opens when you click it has a "Shell..." option. That is your GitHub BASH shell. Open it, create a new SSH key pair, then send the public one to GitHub. The complete instructions for this are here, but there's one crucial twist: instead of ssh-agent -s as shown in the last screenshot at step 2 on that page, you must type eval `ssh-agent -s`. Only then can you ssh-add your new private key. Details are offered on StackOverflow. You only need to do this step once. Subsequent RStudio projects that you'll be version-controlling on GitHub from inside RStudio will use the same key pair for authentication.

Step 4: Create a brand new GitHub repo at github.com

Now go to GitHub and create a bare-bones new repo of the same name as your directory at Step 2 above. Do not check the box "initalize this repo with a README file" because that part can wait, and doing so will bypass some options you'll want. When you create this new repo, you will be given the options
- Quick setup — if you've done this kind of thing before
- …or create a new repository on the command line
- …or push an existing repository from the command line
The third is the one you want. Now you go back to RStudio, but before you leave notice that though the "clone URL" that you can copy to clipboard says https:// by default, that is not your only option. If you read the hint that "You can clone with..." and click SSH, the URL changes to something that starts with git@github.com:. That's the one you'll want. Save it to the clipboard now.

Step 5: Add a remote repo

Now you're back in RStudio. If at the Git BASH shell prompt you type git remote -v and hit Enter, you should see nothing, because you don't yet have a remote repo. You add one with git remote add origin git@github.com:[yourusername]/[yourrepo].git where the part after the word origin comes from what you saved to the clipboard at Step 4 above.

Conclusion

Once you complete the 5 steps above, you can git push -u origin master for the first commit, and then commit, push, pull, etc. directly from RStudio. Either skipping this little "eval" tweak or using the https://[...] URL for the remote repo instead of the git@[...] one will cause the ssh connection to fail, and RStudio will be unable to push to the remote.

I don't know why this had to be so hard to set up, but there it is. I wrote it down because it took trial and error. Anyway, this is how you hook RStudio for the first time to a pristine, not-yet-having-had-the-first-commit GitHub repo. This is the kind of repo you need for an R project that starts in a new directory, with the further options that it can be an empty project, a package, or a Shiny app.

A much easier way to go (especially once you have a SSH key pair set up) is to set up your repo at github.com with the box "initalize this repo with a README file" checked. That immediately triggers your first commit. Next, you go to RStudio and start a brand new project but this time you pick the "Version Control" option (bottom of the dialog box) instead of either of the two above it ("New Directory" or "Existing Directory as of this writing). You then pick Git, then give it the git@[...] URL you saved to the clipboard at step 4, and you're good to go.

This, though, will be a bare-bones project, and it's up to you to fill in the goods. None of the setup work that RStudio provides for new packages or Shiny apps will be done by default. You can see why this is: the typical use case for a project checked out from a version control repository is that you're picking up where you or somebody else left off earlier: there's some work in progress you'll be making use of, not just an empty repo that you happened to start with the box "initalize this repo with a README file" checked.

Either way you do it, starting afresh or checking out from an existing repo, you still have the option to revert to the GitHub app if, say, you miss the Sync button. You just need to do a push and pull from RStudio to make sure that your local copy is identical to the remote, and then let the GitHub app clone the remote onto the local one. No conflicts will be reported, and from then on you can handle the version control from either app.

Amending a hasty commit

$
0
0

On my current project I occasionally have to report anomalies in input data to people upstream, who can look into them. I do this with html files which I knit from R Markdown. They have to include prose, code, results, and pictures. As I edit them on my way to the final product, it's practical to cache some of the code chunks, especially if they require loading large input files or reading from a database.

The cache folders can get quite big. If it so happens that I haven't edited .gitignore to skip cache folders, my next commit will be slow and the Sync will fail because the enterprise GitHub server has a 50M file size limit.

But if you've already hit 'commit' and the Sync button in the GitHub app, what do you do? It turns out you can edit .gitignore so it knows better next time, maybe delete the offending files -- not that you strictly have to, but these cache folders won't be needed once the report is in its final format and they just take up disk space -- and then do this at the command line:

git status
git add --all
git commit --amend --no-edit
git status

The first git status will show that you have deleted files and modified .gitignore. The git add --all construct, unlike git add ., will stage the deletions as well as the modification. Then git commit --amend --no-edit is a brand new commit on top of the old one, as explained here. The second git status confirms that all is well, which you can see again when you switch back to the GitHub app: the Sync button tells you that you're ahead by one commit, you click it, and the push is quick, because the huge cache files are gone.

I wrote an R data frame to a Teradata table on a Mac

$
0
0

Here's how I did it:

  1. On a new Mac running Mavericks and R 3.1.2 with devtools, I installed Java for Mac.
  2. I installed the RJDBC package from CRAN (which depends on the DBI package also from CRAN) and the teradataR package from GitHub.
  3. I downloaded the Teradata JDBC driver, unpacked it, and moved tdgssconfig.jar and terajdbc4.jar to /System/Library/Java/Extensions.

After that, writing the data frame foo to the table DATABASE.BAR was as simple as:

conn <- teradataR::tdConnect(dsn='datamart.mycompany.com',
                             uid='user',pwd='pass', 
                             dType='jdbc')
teradataR::tdWriteTable(databasename='DATABASE', 
                        tablename='BAR', df=foo)
teradataR::tdClose(conn)

I had to do this because DBI::dbWriteTable() now fails on Teradata as explained here.

My thanks go to Jeffrey Wong for mirroring and nurturing the no-longer-supported teradataR package, and to Skylar Lyon for finding Jeff's repo.

Regularization and the Bible

$
0
0

There are two kinds of regression regularization: ridge, or L2, and Lasso, or L1. They both let you get rid of some variance (less is good) in return for picking up some bias (of which also less is good; that's the trade-off).

How they go about it makes sense if you compare two versions of the Ecclesiastes 7:16 verse, which is good advice either way you read it, but can be interpreted a little differently in the two flavors you'll see compared here.

The ESV could be read as "be righteous in all matters, but not overly so." That's L2, ridge. The KJV version might be read as "be righteous, but only about some things." That's L1, Lasso.

This all came up because I had to explain regularization recently, and because I just finished re-reading Keep the Aspidistra Flying. Ecclesiastes 7:16 makes an appearance toward the end of the book.

Getting started with the NooElec NESDR Nano on OS X Yosemite

$
0
0

This is how I did it. Your mileage may vary.

Prerequisites:
- XCode with command line tools
- MacPorts
- cmake, autoconf, automake, libtool, libusb*, rtl-sdr

All programs listed on the last item above came from MacPorts**, with sudo port install PROGRAM with the exception of libusb. The first four on the list are required for compiling the last two from source -- and in that order, because rtl-sdr depends on libusb. Unfortunately, compiling from source did not work for me: rtl-sdr kept failing to find libusb. Installing them from MacPorts did the trick. Upshot: save yourself some trouble, use MacPorts. Other upshot: it's entirely possible that if you get rtl-sdr from MacPorts you don't need any of the first five programs in that list.

* For libusb I did sudo port install libusb +universal because of this thread.

** After you install MacPorts you can check port version but you have to do it in a fresh terminal window. You will find it, then you can do sudo port selfupdate just to get into the habit. Three more things:
1. Every single thing you install with sudo port install will have to be launched from a fresh terminal window.
2. Do sudo port selfupdate often.
3. See also your other sudo port options: clean, uninstall, upgrade.

I think I'm on my way to putting this thing to use. See below:

$ rtl_test -t
Found 1 device(s):
  0:  Realtek, RTL2838UHIDIR, SN: 00000001

Using device 0: Generic RTL2832U OEM
Found Rafael Micro R820T tuner
Supported gain values (29): 0.0 0.9 1.4 2.7 3.7 7.7 8.7 12.5 14.4 15.7 16.6 19.7 20.7 22.9 25.4 28.0 29.7 32.8 33.8 36.4 37.2 38.6 40.2 42.1 43.4 43.9 44.5 48.0 49.6 
Sampling at 2048000 S/s.
No E4000 tuner found, aborting.

Alright, I'm sure that this "No E4000 tuner found, aborting." is some cause for concern, but this note is about getting started.

R version of tsfill and xfill combined

$
0
0

Suppose you have an unbalanced panel with daily data observed at the zip code level, with some zip codes (labeled zip.cd) not having any records on some days (labeled cal.dt). Suppose also that these zip codes are clustered in DMA's (labeled dma.cd).

Suppose now that you want it balanced so all zip code + date combinations show up, with missing values filled in as needed. And you also want all zip code + date combinations to have a DMA code associated with them.

If this panel were a Stata data set, the xtset command would make Stata recognize it as a panel, and then tsfill, full would take care of the first half of the problem. This would leave you with some missing values for dma.cd, corresponding to zip code + date combinations not observed in the original data set. You would fill these in with xfill for the complete solution. This is both elegant and well-documented elsewhere.

If your panel is a data table x, the function below is one way to solve both halves of the problem in one call:

myXTFill <- function(x) {
  dt     <- copy(x)
  xtkeys <- c('zip.cd','cal.dt')
  clkeys <- c('zip.cd','dma.cd')
  xtfull <- CJ(unique(x[,zip.cd]),unique(x[,cal.dt]))
  clfull <- unique(subset(x,select=c(zip.cd,dma.cd)))
  setnames(xtfull,c('V1','V2'),xtkeys)
  setkeyv(xtfull,xtkeys)
  setkeyv(dt,xtkeys)
  setkeyv(clfull,clkeys)
  xtfull <- subset(merge(xtfull,dt,all=TRUE),select=-dma.cd)
  xtfull <- merge(xtfull,clfull,all=TRUE)
}

Use it at your own risk.


Backing up FreeNAS to AWS Glacier

$
0
0

On my FreeNAS home server I added a new jail, called it awsboss. I also added a new ZFS dataset, called it storage. I added this data set to awsboss as /mnt/storage and I also made it into an AFP share so it could be mounted by my Mac clients.

The storage database will receive from Mac clients in my house various tarballs of files that I want to keep in deep storage on Glacier, such as old pictures. Then awsboss will upload them to Glacier from there.

The test file for this exercise is pics2011.tgz, a 2G tarball of everything that's in the ~/Pictures/iPhoto Library/Masters/2011 folder on my Mac. If this works, then at the end of every year I'll just make another tarball with all of the pictures and videos taken that year and send it along. These tarballs might as well go into a vault of their own, named something like iPhotoGabi, which I created at the AWS Console.

Items 1-4 below happen at the root prompt on awsboss:

  1. I installed and configured the AWS Command Line Interface (CLI) using the instructions for Mac/Linux. For this, I had to install pip first using the get-pip.py script. Configuration instructions for AWS CLI are here and they include a link to getting started with Identity and Access Management (IAM), which is where you set up the access key pairs that the CLI needs. If you follow the configuration instructions you're in effect writing two files: ~/.aws/credentials and ~/.aws/config.
  2. I installed boto, which also needs AWS credentials in the ~/.boto file, but these are slightly different from those stored in ~/.aws/credentials. The latter can include key pairs for multiple users, each pair under a header with the [userName] in brackets as shown. The ~/.boto file is user-specific, so it includes only one pair under the header [Credentials]. So making ~/.boto a symlink to ~/.aws/credentials won't work. I tried.

  3. As of this writing boto comes with a glacier command of its own, so now things should be easy. I expect that typing glacier upload iPhotoGabi /mnt/storage/photos2011.tgz will do the job.

  4. I was wrong. There was a bit of a problem. In response to the glacier upload command above I got a bunch of Python error references on screen, the last of which was boto.glacier.exceptions.UploadArchiveError: An error occurred while uploading an archive: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128). This may look obscure, but it's something you can Google. I ended up on Stack Overflow and fixed two things: I created the file /usr/local/lib/python2.7/site-packages/sitecustomize.py to change my default encoding to Unicode as shown here; and then I fixed line 1276 of /usr/local/lib/python2.7/site-packages/boto/glacier/layer1.py as shown here.

There you have it. As of this writing, that 2G upload is in progress, off the awsboss jail, and my Mac client is free to do other things. God, I hope this works.

The next day

The upload completed and the iPhotoGabi vault now shows two items and a total size of 2.04GiB. One of the two items may be a small pdf I uploaded just for testing. To get a listing of what exactly is in a vault is a two-step process: first you initiate an inventory retrieval job, then you wait. When the retrieval job completes you can collect the output. You know that the job completed when status in the response to
$ aws glacier --account-id='-' --vault-name='iPhotoGabi' list-jobs changes from InProgress to Succeeded.

That response contains one line per job, and in each line there is a very long alphanumeric string which is the unique job ID. That's the one to pass in the command

$ aws glacier --account-id='-' --vault-name='iPhotoGabi' --job-id="longstringhere" get-job-output output.json

The file output.json in the working directory contains a JSON object that lists archive ID's and descriptions.

The directions for this are on Reddit.

Before this retrieval job completes there's not much I can do, but using the syntax in the examples on that same Reddit thread, I can at least describe my vault:

$ aws glacier --account-id='-' --vault-name='iPhotoGabi' describe-vault

All of this only worked after I re-ran aws configure and entered the key pairs and region for the default user. Either the AWS CLI won't recognize specific IAM profiles, or they must be passed along as an extra parameter. That's fine. The default user works for me.

Conclusion

This seems to work. I uploaded a 2G tarball and a small pdf using boto's glacier upload facility within a reasonable amount of time from a FreeNAS jail, and managed to retrieve them both using the AWS CLI.

More about FreeNAS jails

$
0
0

My first attempt to upload a large tarball to Glacier was a success, as described here, but subsequent ones failed. The reason, I suspect, is that the FreeNAS server is headless. I get shell access to its jails in a browser tab from one of the Mac clients. If I launch a slow upload and I close the browser or the client goes to sleep while the upload is in progress, the shell session terminates and the upload is aborted.

The solution seems to be to run these shell sessions in tmux. I have completed a second upload -- pics2012.tgz, weighing in at 1.4G -- in a tmux session that persisted through two browser shutdowns and one client sleep session, returning to work as expected with tmux attach every time.

For this though I had to install tmux on the awsboss jail first, and I ran into a problem: pkg install tmux would not work, and pkg install <ANYTHING> would not have worked either, for that matter. There are two possible fixes. One is to fix pkg install and try again pkg install tmux. The other is to ignore the problem and try pkg_add tmux instead. This is the old-school way of adding packages and it is discouraged with the warning that it will "cause inconsistencies in the package management database." Whatever that means, it sounds serious, though a warning is not the same thing as an error, so I did it anyway.

While I was at it, I also ran pkg_add mosh, because MOSH seems to be another way to get state-preserving shell connections. This is of no use to me at the moment, but it will come up later. This post is motivated by my trying -- and failing -- to get pkg install to work properly on a FreeNAS 9.2.0 jail. I thought that would be easy, but I was wrong.

My googling turned up a fix that applies to FreeNAS 9.3, which allows two kinds of jails: port and standard. Port jails can run pkg install <ANYTHING>. Standard ones by design do not, but they can be persuaded as described here. I thought that the fix would apply to FreeNAS 9.2 jails as well, but the steps outlined in this discussion rely on some source files being present at /usr/src/share/keys so that you can run

cd /usr/src/share/keys && make && make install

There is no such directory path in a regular FreeNAS 9.2.0 jail, so this option, as far as I can tell, is out. But some pieces of the process did work, and may have been beneficial, as follows:

  1. # rm /usr/local/etc/pkg.conf
  2. # pkg2ng

The first step above gets rid of a useless pkg.conf that has one line that trips up pkg install, so no loss there. The second is more interesting: it did something to both tmux and mosh. I don't know, but I hope that whatever inconsistencies the FreeNAS documentation warns about when it discourages the use of pkg_add are fixed if you run pkg2ng immediately after you pkg_add anything. The job of pkg2ng seems to be that it converts installed packages from old-school pkg_ style to the newer pkgng style that in theory should have worked out of the box.

That's all I have: the hope that pkg_add won't screw up anything in practice, and that pkg2ng will fix what pkg_add could have screwed up in theory.





Latest Images