Not impressed with Lion
Been using OS X Lion @ home for a couple of weeks and I am not impressed. Tiger (which I use for work) feels clean and light gets out of your way almost like a waiter that keeps your glass full without you ever noticing.
Lion on the other hand feels like a needy high maintenance girlfriend that tries to grab your attention all the time. Really do I need my emails thrown on a pile ? Why do I need a grey background for my email to show a thread (good thing I have a 17 inch screen). And why does my calendar have to look like an old leather binder ? With a look of torn pages on top ? Why does my $2000 mac be reduced to look like this ? And the pop up dialog boxes feel like they are going to fly off the screen and hit me in the face. Feels more clunky and resource intensive. And come on why does my terminal have a busy icon on the top ? What the heck is it doing ? I am not even typing anything. By the way Apple there is something called oh I don’t know ‘history’ in the shell that tells me what I did before I don’t need you to remember and show it to me every time I reopen a terminal.
You know what this reminds me off ? The transition from Win ’98 to XP, oh Snap!
Maven Integration Tests
Ever forget to add @Ignore to your integration test and have the rest of the team complain or create a different project just to hold the integration tests, well no more.
With the maven failsafe plugin, you no longer need to ignore your integration tests. This plugin will pick up any tests that have *IT*.java in them and run it for you.
Continue to run you regular tests with ‘mvn clean install’, if you want to run your integration test run ‘mvn failsafe:integration-test failsafe:verify’
Don’t forget to remove the @Ignore and rename your tests to *IT.java.
Stop Words and commonly concatenated words
Here is a list of about 800 stop words made based on 4 million documents (I started with this set). This set has helped us reduce model size and increase accuracy, please note that the same list may not be applicable in your application, please review the list before using.
More interestingly here list of commonly concatenated words that I found and their corrections.
December is a great time to work
Co-workers are playing with remote-controlled helicopters, exchanging recipes for goodies, cake in the kitchen, low traffic going to work, lots of laughs in the office.
How I wish it stayed liked this for the rest of the year.
Caching method calls or Memoization
Just as we use Hibernate 2nd level cache to store data, we can also save results from a method call this is a pretty old technique and in fact functional programming languages like Haskell have this feature built in and call it with a fancy name called memoization, http://en.wikipedia.org/wiki/Memoization
Here is how it is done in spring http://springtips.blogspot.com/2007/06/caching-methods-result-using-spring-and_23.html
You can read the details in the link, but on a high level when you call a method the result is stored in the cache and the next time around the result from the cache is used, as usual you can declare how long the cache should stay active and so on in a simple ehcache.xml config file.
There is also an open source project that now lets you just decorate methods with @ Cacheable annotation and it takes care of the rest http://code.google.com/p/ehcache-spring-annotations/wiki/UsingCacheable
# of lines of code in your project
As I wait for the build I wrote up this post, has absolutely no point, just an observation. At present the code base that I work with everyday has :
1030467 lines of Java
641411 lines of Xml
224530 lines of Jsp
58950 lines of plain text
102751 lines in property files
2246 lines of groovy
1353693 lines of SQL (schema files, dml, ddl….)
90 Projects
7186 Java files
2547 SQL files
How many does yours ?
Its easy, run this find . -name ‘*.java’ | xargs wc -l | grep total | sed ‘s/total//g’
Favorite languages, why so great? and why not so much?
About my favorite languages, I actually have 2 favorite languages
- Ruby: for all scripting and making quick apps.
- Clojure: for development.
Why Ruby is great
- The language was designed for programmer use, you can see that from the api which is totally intuitive.
- Lots of libraries, my favorite is Sinatra which lets you build quick and dirty web apps and the other is Sequel.
- I wrote a blog post on how to delete RFC-822 in compatible emails (if you are a developer using linux and your company uses Outlook you know what I am talking about), this is a simple example of how I have used Ruby to make quick and dirty scripts.
I have used Ruby numerous times to write scripts to fix production data, correct files, and to generate complex reports. I have used Sinatra with Google Charts to make web apps that can show load times, server status ….
Why Ruby is not so great
- Not really meant for performance, recent years there is a push to develop a virtual machine for Ruby but it is still not anywhere close to C/Java performance.
- Rails is a pain to deploy, Heroku takes away the pain but what do you do if you have to deploy internally ? I personally have 2 apps on Heroku one of which is http://first3links.com/
Why Clojure is great
I have been on a quest to learn a functional programming language for the past 3 years, I have read the Erlang book (please see the various posts I wrote about Erlang here). Erlang is a fine language but I lost interest in it after I could not find a single good library that can connect Erlang to Oracle. The problem, there are too few 3rd party libraries. The next language I looked at was Haskell, lots of libraries and seems to be good at performance on the surface, problem I see is acceptance by business, where most of the code is in Java. Then I found Clojure and fell in love with it.
- It is just another DSL for the JVM, if you provide type hints the code generated will be the same as what Java would (can easily sneak it in).
- Totally embraces the JVM unlike JRuby.
- The author Rich Hickey has done a lot to reduce the pain points of lisp.
- Finally a language that frees you mind of OOP ( Have you ever noticed how much time you spend in trying to achieve the best object model when a simple one would do ? and for what ? the customers don’t care as long as it works, the computers sure don’t care as long it is 0s and 1s)
- Code is so concise and elegant.
Why Clojure is not great.
- It has been called as the language with the steepest learning curve on the JVM, I tend to agree with it.
- Unlike Scala you have no wiggle room, it is either functional code or nothing ( I like this feature actually).
- Debugging is a major pain point. (Though there has been improvement with the latest clojure-swank).
I have written many posts on Clojure on my blog you can see them here. In the most recent post I show you one can parse a one million record file in less than 15 seconds with clojure.
Denormalizing One million records with Clojure.
MovieLens is a research project that provides datasets of various sizes and attributes, containing movie ratings. These datasets are free to download and use for non-commercial purposes. They have done an awesome job putting this data together and a big thanks goes to them for making it available.
I wanted to exercise my Clojure skills (more like add to my tiny set of Clojure skills
) and it just so happens that I recently came across the MovieLens project, so how about analyzing that data using Clojure ?
One of the datasets they make available is the One Million Dataset, this set consists of 3 files
- “movies.dat” containing 3883 movie listings, contains title, genre…
- “users.dat” containing 6040 unique users, contains age, occupation, gender …
- “ratings.dat” containing 1000209 movie ratings, that references movie id and user id from the above 2 files.
I could analyze this data to answer questions such as, What age group gave the most ratings ? or What was the highest rated movie for a given time period ?
But before I could do this I wanted to denormalize the ratings file so that it also contains the user and movie information, why ? cause I don’t want to look it up when I am analyzing the data, each record should be self contained.
The outline of the program is quite simple
- Read the users file into memory
- Read the movies files into memory
- For each line in the ratings
- Find the corresponding movie and user
- Print it out to a file.
Take a minute to think how would you do this in java and then look at the below code. I ran it on a Dell laptop dual 2.2Ghz laptop with 4 gig of ram and care to guess how long it takes ?? scroll down for answer.
-
-
(ns com.dev.file-reader
(:use [clojure.contrib.duck-streams])
(:import [java.io BufferedReader FileReader BufferedWriter FileWriter]))
(defstruct user :id :gender :age :ccupation :zip-code)
(defstruct movie :id :title :genres)
(defn format-user [user] (str (:id user) "::" (:gender user) "::" (:age user) "::" (:ccupation user) "::" (:zip-code user)))
(defn format-movie [movie] (str (:id movie) "::" (:title movie) "::" (:genres movie)))
(defn read-user-file [fileName]
(loop [users {} fileSeq (read-lines fileName)]
(let [line (first fileSeq)]
(if (nil? line)
users
(let [tokens (.split line "::")
id (aget tokens 0)
user (struct user id (aget tokens 1) (aget tokens 2) (aget tokens 3) (aget tokens 4))]
(recur (merge users {id user}) (rest fileS)))))))
(defn read-movies-file [fileName]
(loop [movies {} fileSeq (read-lines fileName)]
(let [line (first fileSeq)]
(if (nil? line)
movies
(let [tokens (.split line "::")
id (aget tokens 0)
movie (struct movie (Integer/parseInt (aget tokens 0)) (aget tokens 1) (aget tokens 2))]
(recur (merge movies {id movie}) (rest fileS)))))))
(defn convert-ratings-file
"read the ratings file and denormalize it"
[moviesF usersF ratingsF outputF]
(let [movies (read-movies-file moviesF) users (read-user-file usersF)]
(with-open [#^BufferedReader rdr (BufferedReader. (FileReader. ratingsF) 1048576)
#^BufferedWriter wtr (BufferedWriter. (FileWriter. outputF) 1048576)]
(doseq [line (line-seq rdr)]
(let [tokens (.split line "::")
user-id (aget tokens 0)
movie-id (aget tokens 1)
user (get users user-id)
movie (get movies movie-id)
rating (aget tokens 2)
timestamp (aget tokens 3)]
(.write wtr (str (format-user user) "::" (format-movie movie) "::" rating "::" timestamp "\n")))))))
(defn doIt []
(time (convert-ratings-file
"movielens-1m/movies.dat"
"movielens-1m/users.dat"
"movielens-1m/ratings.dat"
"movielens-1m/output.dat"
)))
So ready with you guess ??
I ran the program 5 times and here is the output
"Elapsed time: 12130.035819 msecs"
"Elapsed time: 13113.92823 msecs"
"Elapsed time: 13364.234216 msecs"
"Elapsed time: 12553.478168 msecs"
"Elapsed time: 14488.706176 msecs"
On average 13.130076521799994 Seconds to read in 1 million records, for each record look up the movie and user and write it back to the disk.
Clojure puts the FUNctional back in programming.
