Category Archives: Entrepreneurship

Cassandra and Big Data – building a single-node “cluster”

Cassandra – Getting off the ground.
Continuation of post: Apache Cassandra Project – processing “Big Data”

While researching a project on Big Data services, I knew that I’d need a multi-node cluster to experiment with, but a pile of hardware was not immediately available.

Using the VERY helpful book Cassandra High Performance Cookbook I was able to build a 3 node cluster on a single machine. This is how I did it:


For this cluster test example, I am using Ubunto 10, with following JVM

      JVM vendor/version: OpenJDK 64-Bit Server VM/1.6.0_22

Downloaded Cassandra 1.0.8 package from here:
http://apache.mirrors.tds.net//cassandra/1.0.8/apache-cassandra-1.0.8-bin.tar.gz

Created new user on system: bigdata

Create the required base data directories

  $ mkdir commitlog,log,data,saved_caches

Moved that package there and started the build

$ cp /tmp/apache-cassandra-1.0.8-bin.tar.gz .

Unzipped and extracted the contents

$ gunzip apache-cassandra-1.0.8-bin.tar.gz
$ tar xvf apache-cassandra-1.0.8-bin.tar

Moved the long directory name to first instance cassA-1.0.8

$ mv apache-cassandra-1.0.8 cassA-1.0.8

Extracted again and renamed this to the other two planned instances:

$ tar xfv apache-cassandra-1.0.8-bin.tar
$ mv apache-cassandra-1.0.8 cassB-1.0.8  

$ tar xfv apache-cassandra-1.0.8-bin.tar
$ mv apache-cassandra-1.0.8 cassC-1.0.8  

This gave me three packages to build, and each with a unique IP

  cassA-1.0.8   10.1.1.101
  cassB-1.0.8   10.1.1.102
  cassC-1.0.8   10.1.1.103

Edit configuration files in each instance (casaA-1.0.8 used as example:)

$ vi cassA-1.0.8/conf/cassandra.yaml 

[...]

# directories where Cassandra should store data on disk.
data_file_directories: 
    - /home/bigdata/data/cassA

# commit log
commitlog_directory: /home/bigdata/commitlog/cassA

# saved caches
saved_caches_directory: /home/bigdata/saved_caches/cassA

[...]

# If blank, Cassandra will request a token bisecting the range of
# the heaviest-loaded existing node.  If there is no load information
# available, such as is the case with a new cluster, it will pick
# a random token, which will lead to hot spots.
initial_token: 0

[...]

# Setting this to 0.0.0.0 is always wrong.
listen_address: 10.1.1.101

[...]

rpc_address: 10.1.1.101

[...]

          # seeds is actually a comma-delimited list of addresses.
          # Ex: ",,"
          - seeds: "10.1.100.101,10.1.100.102,10.1.100.103"
[...]

Setting a separate logfile is recommended. Edit config to set separate log

vi cassA-1.0.8/conf/log4j-server.properties

[...]
log4j.appender.R.File=/home/bigdata/log/cassA.log
[...]

Repeat for instances cassB and cassC, setting the token value for B and C to appropriate values (see Extra Credit below if you need to know how to do *that* part):

#cassB
initial_token: 56713727820156410577229101238628035242

#cassC
initial_token: 113427455640312821154458202477256070485

To enable the JMX management console, each instance will require it’s own port. Edit the env file to set that up.

vi cassA-1.0.8/conf/cassandra-env.sh

[...]
# Specifies the default port over which Cassandra will be available for
# JMX connections.
JMX_PORT="8001"
[...]

Repeated for the other two instances, defining 8002 and 8003 respectively.

Now, for the final trick, start up the instances:

  cassA-1.0.8/bin/cassandra
  cassB-1.0.8/bin/cassandra
  cassC-1.0.8/bin/cassandra

Cluster elements started up, and they can be seen active in the process table here:

$ ps -lf
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S bigdata   4554     1  2  80   0 - 226846 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 S bigdata   4593     1  2  80   0 - 210824 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 S bigdata   4632     1  2  80   0 - 226830 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 R bigdata   5047  3054  0  80   0 -  5483 -      12:16 pts/0    00:00:00 ps -lf

Finally, to check the status, connect to of the JMX node ports and check the ring. You only need to connect to one of the cluster’s nodes to check the complete cluster’s status:

$ bin/nodetool -h 10.1.100.101 -port 8001 ring
Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               113427455640312821154458202477256070485     
10.1.100.101    datacenter1 rack1       Up     Normal  21.86 KB        33.33%  0                                           
10.1.100.102    datacenter1 rack1       Up     Normal  20.28 KB        33.33%  56713727820156410577229101238628035242      
10.1.100.103    datacenter1 rack1       Up     Normal  29.1 KB         33.33%  113427455640312821154458202477256070485      

Now, that’s a functional 3 instance cluster running on a single node. These are not in separate VMs, and if you wanted to experiment with this on a larger cluster, running multiple instances on multiple VM’s on a single hypervisor.. I don’t really see why you cannot!

In the next article, I’m going to start feeding data into the cluster. Stay tuned for that!


Extra Credit:

To create the token value I needed for this three ring cluster, I used the following PERL script. BTW, bignum is required unless you want PERL printing these big numbers in scientific notation:

#!/usr/bin/perl
use bignum;
my $nodes = shift;
print "Calculate tokens for $nodes nodes\n";
print "node 0\ttoken: 0\n" unless $nodes;
exit unless $nodes;
my $factor = 2**127;
print "factor = $factor\n";
for (my $i=0;$i<$nodes;$i++) {
	my $token = $i * ( $factor / $nodes);
	print "node $i\ttoken: $token\n";
}

Running the script for three nodes gave me the following results:

$ ./maketokens.pl  3

Calculate tokens for 3 nodes
factor = 170141183460469231731687303715884105728
node 0	token: 0
node 1	token: 56713727820156410577229101238628035242.67
node 2	token: 113427455640312821154458202477256070485.34

Additional Comments:

If you are setting up a standard mutli-box cluster, make sure you have the following ports opened up on any firewalls. If not, the cluster members wont' find each other:

# TCP port, for commands and data
storage_port: 7000

# SSL port, for encrypted communication.  Unused unless enabled in
# encryption_options
ssl_storage_port: 7001

NEXT: Setting up a Java build env to prepare for Cassandra development

Current Reading List – Feb 2012

It has been a good many years since I have posted about my current reading list, so I thought it was about time to fire off another one. These are the books started, completed, being read or on my short-list to start (or in one case re-read) in the month of February.


The 4-Hour Workweek (completed)

I found this book amazingly insightful. Regardless of how much you implement in your own career, it’s a fantastic tome. Those I’ve gifted the book too have all said they really found it useful, interesting and a true paradigm shift in how they view life, career, family and finding a new balance between them that suits you!.

Tim laid out his own struggles in great candor, failures in life time management and how he found a way to over-come all of them. The book is also filled with testimonials from readers of his first edition. If you’re finding that you want more out of your life, struggling with the concept of retirement and wondering what you’ll do when you retire, this book may upset your world, but hopefully in do some will show you some options you might not have considered. Give this a read!


Design of Design (finishing up)
Over the many years in the role of software designer (originally trained in the 80’s, which is my biggest challenge to overcome), there has always been a nagging sense that some part of the process was not working for me. I adjusted, tried other methods, made adaptions, but the old Rational Model (aka Waterfall) of design always seemed to fail me. Now, I understand why! It’s a BAD MODEL. Dr. Frederick Brooks (father of the IBM 360) and now professor at University of North Carolina Chapel Hill, rips open the old concepts in this book of his essays on design.

Covering a variety of other design methodologies, this book is not only a theoretical read, but an empirical one. Many real-world examples of design program successes and failures are laid out, almost in a case study format. This is been a very educational read. Lessons learned from this book have been put into place in current projects, and the results are already starting to be seen. Now I just need to start educating my staff and colleagues on these findings. Recommended.


The Creative Priority
I originally read this book in the last 90’s and found it very useful in understanding the creative process. What drives creatives and how to foster a creative culture. Sadly, over the hears of the Dot-Bomb meat-grinder cultural immersion, these concepts and skills have been lost. So, I’m pulling this one back off the shelf for a re-read. I plan to report on it soon.


Cassandra High Performance Cookbook
This is the latest addition to the list, having just arrived this weekend. I’m currently running a project to investigate the suitability of Cassandra to solve problems in a client’s current relational database solution (see my previous post about Cassandra for background).

This book was recommended by the primary authors and maintainers of Cassandra. I look forward to cracking this open and going head-first into this technology.

Stanford Entrepreneurial Thought Leaders series – Jessica Mah

Aired 30-November-2011
Interview with inDinero‘s Co-Founder and Architect Jessica Mah. Jessica discusses business accelerators, Angel investors, common start-up mistakes and why she feels it’s not good to “Fake it until you make it.”

inDinero was founded by two people. Jessica said that having a co-founder was really important. Having a trusted person that can confirm or refute ideas is very useful. Thinking about what your company culture should be up front, it’s OK to have a fun working environment (she likened inDinero to more like a club), but it still needs to be run as a business. A party culture is not likely to succeed. One thing they do to maintain a cohesive workforce and to exchange ideas, is to get out of the office once per week and all have dinner together.

Jessica related that they got a great jump start on the company, but applying to and becoming part of Y-Combinator, which provides a small amount funding for start-ups. She had very high praise for Y-Combinator, and said she she sometimes wishes they were still in that environment. At the end of Y-Combinator they had the opportunity to pitch their company to a variety of potential investors, and as a result of that they raised over $1,000,000 in start-up funding.

inDinero decided to go strictly with Angel investors, and not Venture Capitalists. Going with the Angel investors had a number of advantages. Along with having more direct control over the organization, as opposed to going with VC’s, she felt they had a larger group of advisers with which to consult. That said, she also cautioned against raising money too early.

But even with million dollars plus in funding, things were still not easy. This added significant stress and pressure to perform. She wished that they had taken as much funding as they possible could up front. They had under-estimated the costs of mistakes, such as leasing a fancy office, or making bad hiring decisions.

One of the toughest lessons they learned, was that money goes fast. Keeping control of costs, despite what seemed to be ample funding, is critical to remaining both operational and emotionally strong. One costly mistake they made was the leasing of a nice fancy office. It presented a nice face for visiting clients, but internally Jessica said she felt more like a con artist, knowing the reality of their success. They have disposed of the fancy office (that included a hot tub), and are now operating out of an apartment. She felt that it made them feel more scrappy, as opposed to content.

Staffing the company also turned out to be more challenging than she anticipated. The practice of hiring on intelligence alone, turned out to be a mistake. Employees need to be rounded, able to work well with others, and also with customers. They found that the best way to get the A-Players that every start-up needs, is to give the a test drive first. It’s the fastest path to finding the gold. You only want to keep A-Players. Having under performers in a seat is far more costly than having that seat empty (this is painful lesson I’ve learned as well, and one I’m still trying to address). Bottom line, It’s better to leave that seat empty.

Being a technology company that delivers it’s product over the web, product design and testing is a critical part of remaining useful and relevant. Useability testing was the smartest thing they did. Usability testing has to be done IN PERSON. It’s the only way to do this (Steve Krug, a well known author on UI design covers this in his book Rocket Surgey Made Easy ). They met with dozens and dozens customers to find out what what the product really should be. It’s not unexpected that the product will need to adapt to the reality of customer desires. However, that does not mean that the product needs to change to meet every customers every need. For example, some features desired would have turned their product into a full blown accounting system, and they had to point out to those customers that they signed up with inDinero because they wanted something simple, not a big full-blown accounting system, and they were not planning to move in that direction.

Several lessons were learned along the way. Customers do not care about elegant or prefect code. The product evolution was iterative. One simply can’t figure out everything customers wanted up front, and simply pumping the product up with tons of features is not a good road map to success.

Long term product release planning was also a hindrance. They found that planning product released 3-6 months out was far too long. Customers couldn’t really wait that long for the features they needed, so they shifted to a 2-3 week product plan cycle. In contract to that less though was the imperative to NOT release too early! She said that start-ups must resist the urge of investors to release a product too early. Instead they bowed to the pressure and initially released the product too early, squandering a lot of useful PR with and immature product.

One tool they used to gain access to customers that would provide useful feedback on the product was placement of a “Would you recommend” link. That link asked them if they would recommend the product, if they felt the product had promise, or they simply didn’t like the product. Jessica focused on those that thought the product had promise, but would not recommend it. She found they to have the most useful feedback. Those that indicated that they simply didn’t like the product were customers that they really didn’t need their product anyway.

Finally, there were a few other important observations and tips.

  • Don’t get yourself hung up on vanity metrics, like followers, hits, etc. Stick to the metrics that matter like adoption and revenue.
  • College didn’t teach them how to build real world useful products.
  • A business plans seldom survive first contact with customers.
  • Recommenced book: 4 Steps to Epiphany

  • When you start hating your customers, you are OUT OF BUSINESS.