Re-Configuring an Empty Cassandra Cluster

PREV: Setting up a Java build env to prepare for Cassandra development

After doing more research, I decided the Ordered Partitioning was not going to buy me anything but a lop-sided distribution. Looking at this (it’s a case of IP distributions, not hostnames as originally envisioned, that will be a later evaluation).

I’d have 3 very heavy nodes and 3 very light nodes. This is a distribution of real world data.

Node:  Range:                             Dist:    
====== ================================== ======  
node00         0.0.0.0 to 42.170.170.171     6 %  
node01  42.170.170.172 to 85.85.85.87       32 %  
node02     85.85.85.88 to 128.0.0.3         34 %  
node03       128.0.0.4 to 170.170.170.175    2 %  
node04 170.170.170.176 to 213.85.85.91      21 %  
node05    213.85.85.92 to 255.255.255.255    3 %  

Goofing around with pseudo random key naming to get a better balance only does one thing, make the keys I wanted to use (IPs) basically worthless, so the ordering is wrecked regardless. Random partitioning is the default configuration for Cassandra, so, that’s what I plan to use. Problem is, I’d built out this specific node set with this setting first:

ByteOrderedPartitioner orders rows lexically by key bytes. BOP allows scanning rows in key order, but the ordering can generate hot spots for sequential insertion workloads.

I re-set the configurations to use the default instead:

RandomPartitioner distributes rows across the cluster evenly by md5. When in doubt, this is the best option.

After changing the configuration from ByteOrderedPartitioner to RandomPartitioner and restarting the first node.. I am greeted with this happy message:

ERROR 13:03:36,113 Fatal exception in thread Thread[SSTableBatchOpen:3,5,main]
java.lang.RuntimeException: Cannot open /home/hpcass/data/node00/system/Versions-hc-3 because partitioner does not match org.apache.cassandra.dht.RandomPartitioner

In fact I’m greeted with a lot of them. This is then followed by what looks like possibly.. normal startup messaging?

 INFO 13:03:36,166 Creating new commitlog segment /home/hpcass/commitlog/node00/CommitLog-1331586216166.log
 INFO 13:03:36,175 Couldn't detect any schema definitions in local storage.
 INFO 13:03:36,175 Found table data in data directories. Consider using the CLI to define your schema.
 INFO 13:03:36,197 Replaying /home/hpcass/commitlog/node00/CommitLog-1331328557751.log
 INFO 13:03:36,222 Finished reading /home/hpcass/commitlog/node00/CommitLog-1331328557751.log
 INFO 13:03:36,227 Enqueuing flush of Memtable-LocationInfo@1762056890(213/266 serialized/live bytes, 7 ops)
 INFO 13:03:36,228 Writing Memtable-LocationInfo@1762056890(213/266 serialized/live bytes, 7 ops)
 INFO 13:03:36,228 Enqueuing flush of Memtable-Versions@202783062(83/103 serialized/live bytes, 3 ops)
 INFO 13:03:36,277 Completed flushing /home/hpcass/data/node00/system/LocationInfo-hc-16-Data.db (377 bytes)
 INFO 13:03:36,285 Writing Memtable-Versions@202783062(83/103 serialized/live bytes, 3 ops)
 INFO 13:03:36,357 Completed flushing /home/hpcass/data/node00/system/Versions-hc-4-Data.db (247 bytes)
 INFO 13:03:36,358 Log replay complete, 9 replayed mutations
 INFO 13:03:36,366 Cassandra version: 1.0.8
 INFO 13:03:36,366 Thrift API version: 19.20.0
 INFO 13:03:36,367 Loading persisted ring state
 INFO 13:03:36,384 Starting up server gossip
 INFO 13:03:36,386 Enqueuing flush of Memtable-LocationInfo@846275759(88/110 serialized/live bytes, 2 ops)
 INFO 13:03:36,386 Writing Memtable-LocationInfo@846275759(88/110 serialized/live bytes, 2 ops)
 INFO 13:03:36,440 Completed flushing /home/hpcass/data/node00/system/LocationInfo-hc-17-Data.db (196 bytes)
 INFO 13:03:36,446 Starting Messaging Service on port 7000
 INFO 13:03:36,452 Using saved token 0
 INFO 13:03:36,453 Enqueuing flush of Memtable-LocationInfo@59584763(38/47 serialized/live bytes, 2 ops)
 INFO 13:03:36,454 Writing Memtable-LocationInfo@59584763(38/47 serialized/live bytes, 2 ops)
 INFO 13:03:36,556 Completed flushing /home/hpcass/data/node00/system/LocationInfo-hc-18-Data.db (148 bytes)
 INFO 13:03:36,558 Node /10.1.0.23 state jump to normal
 INFO 13:03:36,558 Bootstrap/Replace/Move completed! Now serving reads.
 INFO 13:03:36,559 Will not load MX4J, mx4j-tools.jar is not in the classpath
 INFO 13:03:36,587 Binding thrift service to /10.1.0.23:9160
 INFO 13:03:36,590 Using TFastFramedTransport with a max frame size of 15728640 bytes.
 INFO 13:03:36,593 Using synchronous/threadpool thrift server on /10.1.0.23 : 9160
 INFO 13:03:36,593 Listening for thrift clients...

Despite the fatal errors, it does seem to have restarted the cluster with the new Partition engine:

Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               7169015515630842424558524306038950250903273734
10.1.0.27      datacenter1 rack1       Down   Normal  ?               93.84%  -2742379978670691477635174047251157095949195165
10.1.0.23      datacenter1 rack1       Up     Normal  15.79 KB        86.37%  0                                           
10.1.0.26      datacenter1 rack1       Down   Normal  ?               77.79%  896682280808232140910919391534960240163386913
10.1.0.24      datacenter1 rack1       Up     Normal  15.79 KB        53.08%  1927726543429020693034590137790785169819652674
10.1.0.25      datacenter1 rack1       Up     Normal  15.79 KB        35.85%  6138493926725652010223830601932265434881918085
10.1.0.28      datacenter1 rack1       Down   Normal  ?               53.08%  716901551563084242455852430603895025090327373

Starting up the other three nodes (example:)

 INFO 14:10:06,663 Node /10.1.0.25 has restarted, now UP
 INFO 14:10:06,663 InetAddress /10.1.0.25 is now UP
 INFO 14:10:06,664 Node /10.1.0.25 state jump to normal
 INFO 14:10:06,664 Node /10.1.0.24 has restarted, now UP
 INFO 14:10:06,665 InetAddress /10.1.0.24 is now UP
 INFO 14:10:06,665 Node /10.1.0.24 state jump to normal
 INFO 14:10:06,666 Node /10.1.0.23 has restarted, now UP
 INFO 14:10:06,667 InetAddress /10.1.0.23 is now UP
 INFO 14:10:06,668 Node /10.1.0.23 state jump to normal
 INFO 14:10:06,760 Completed flushing /home/hpcass/data/node01/system/LocationInfo-hc-18-Data.db (166 bytes)
 INFO 14:10:06,762 Node /10.1.0.26 state jump to normal
 INFO 14:10:06,763 Bootstrap/Replace/Move completed! Now serving reads.
 INFO 14:10:06,764 Will not load MX4J, mx4j-tools.jar is not in the classpath
 INFO 14:10:06,862 Binding thrift service to /10.1.0.26:9160

Re-checking the ring displays:

Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               7169015515630842424558524306038950250903273734
10.1.0.27      datacenter1 rack1       Up     Normal  11.37 KB        93.84%  -2742379978670691477635174047251157095949195165
10.1.0.23      datacenter1 rack1       Up     Normal  15.79 KB        86.37%  0                                           
10.1.0.26      datacenter1 rack1       Up     Normal  18.38 KB        77.79%  896682280808232140910919391534960240163386913
10.1.0.24      datacenter1 rack1       Up     Normal  15.79 KB        53.08%  1927726543429020693034590137790785169819652674
10.1.0.25      datacenter1 rack1       Up     Normal  15.79 KB        35.85%  6138493926725652010223830601932265434881918085
10.1.0.28      datacenter1 rack1       Up     Normal  15.79 KB        53.08%  7169015515630842424558524306038950250903273734

Switching partition engine appears to be easy enough. What I suspect however (and I’ve not confirmed this, is that the data would have been compromised or likely destroyed in this process. The documentation I’ve read so far indicated that you could not do this. Once setup with a specific partitioning engine that cluster was bound to it.

My conclusion is that if you have not yet started to saturate your cluster with data, and you wish to change the partitioning engine, it would appear that the right time to do it is now.. before you start to load data.

I plan to test this theory later after the first trial data load to see if in fact it mangles the information. More to follow!

UPDATE!

Despite the information that I thought nodetool was telling me, my cluster was unusable because of the partitioner change. What is the last step required to change partition? NUKE THE DATA. Unfun.. but that is what I need to do.

Having 6 nodes means 6 times the fun. Here is the kicker though, I’ll just move the data aside and re-construct, and that will let me swap it back in if I decided to go back and forth testing the impacts of Random vs. Ordered for my needs. Will I get away with this? I don’t know. That won’t stop me from trying!

The data was stored in ~/data/node00 (node## etc.). This is all I did:

mv data/node00 data/node00-bop       # bop = btye order partition.

Restarted node00:

hpcass:~/nodes$ node00/bin/cassandra -f
 INFO 16:38:46,525 Logging initialized
 INFO 16:38:46,529 JVM vendor/version: OpenJDK 64-Bit Server VM/1.6.0_0
 INFO 16:38:46,529 Heap size: 6291456000/6291456000
 INFO 16:38:46,529 Classpath: node00/bin/../conf:node00/bin/../build/classes/main:node00/bin/../build/classes/thrift:node00/bin/../lib/antlr-3.2.jar:node00/bin/../lib/apache-cassandra-1.0.8.jar:node00/bin/../lib/apache-cassandra-clientutil-1.0.8.jar:node00/bin/../lib/apache-cassandra-thrift-1.0.8.jar:node00/bin/../lib/avro-1.4.0-fixes.jar:node00/bin/../lib/avro-1.4.0-sources-fixes.jar:node00/bin/../lib/commons-cli-1.1.jar:node00/bin/../lib/commons-codec-1.2.jar:node00/bin/../lib/commons-lang-2.4.jar:node00/bin/../lib/compress-lzf-0.8.4.jar:node00/bin/../lib/concurrentlinkedhashmap-lru-1.2.jar:node00/bin/../lib/guava-r08.jar:node00/bin/../lib/high-scale-lib-1.1.2.jar:node00/bin/../lib/jackson-core-asl-1.4.0.jar:node00/bin/../lib/jackson-mapper-asl-1.4.0.jar:node00/bin/../lib/jamm-0.2.5.jar:node00/bin/../lib/jline-0.9.94.jar:node00/bin/../lib/json-simple-1.1.jar:node00/bin/../lib/libthrift-0.6.jar:node00/bin/../lib/log4j-1.2.16.jar:node00/bin/../lib/servlet-api-2.5-20081211.jar:node00/bin/../lib/slf4j-api-1.6.1.jar:node00/bin/../lib/slf4j-log4j12-1.6.1.jar:node00/bin/../lib/snakeyaml-1.6.jar:node00/bin/../lib/snappy-java-1.0.4.1.jar
 INFO 16:38:46,531 JNA not found. Native methods will be disabled.
 INFO 16:38:46,538 Loading settings from file:/home/hpcass/nodes/node00/conf/cassandra.yaml
 INFO 16:38:46,635 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
 INFO 16:38:46,645 Global memtable threshold is enabled at 2000MB
 INFO 16:38:46,839 Creating new commitlog segment /home/hpcass/commitlog/node00/CommitLog-1331599126839.log
 INFO 16:38:46,848 Couldn't detect any schema definitions in local storage.
 INFO 16:38:46,849 Found table data in data directories. Consider using the CLI to define your schema.
 INFO 16:38:46,863 Replaying /home/hpcass/commitlog/node00/CommitLog-1331597615041.log
 INFO 16:38:46,887 Finished reading /home/hpcass/commitlog/node00/CommitLog-1331597615041.log
 INFO 16:38:46,892 Enqueuing flush of Memtable-LocationInfo@1834491520(98/122 serialized/live bytes, 4 ops)
 INFO 16:38:46,893 Enqueuing flush of Memtable-Versions@875509103(83/103 serialized/live bytes, 3 ops)
 INFO 16:38:46,894 Writing Memtable-LocationInfo@1834491520(98/122 serialized/live bytes, 4 ops)
 INFO 16:38:47,001 Completed flushing /home/hpcass/data/node00/system/LocationInfo-hc-1-Data.db (208 bytes)
 INFO 16:38:47,009 Writing Memtable-Versions@875509103(83/103 serialized/live bytes, 3 ops)
 INFO 16:38:47,057 Completed flushing /home/hpcass/data/node00/system/Versions-hc-1-Data.db (247 bytes)
 INFO 16:38:47,057 Log replay complete, 6 replayed mutations
 INFO 16:38:47,066 Cassandra version: 1.0.8
 INFO 16:38:47,066 Thrift API version: 19.20.0
 INFO 16:38:47,067 Loading persisted ring state
 INFO 16:38:47,070 Starting up server gossip
 INFO 16:38:47,091 Enqueuing flush of Memtable-LocationInfo@952443392(88/110 serialized/live bytes, 2 ops)
 INFO 16:38:47,092 Writing Memtable-LocationInfo@952443392(88/110 serialized/live bytes, 2 ops)
 INFO 16:38:47,141 Completed flushing /home/hpcass/data/node00/system/LocationInfo-hc-2-Data.db (196 bytes)
 INFO 16:38:47,149 Starting Messaging Service on port 7000
 INFO 16:38:47,155 Using saved token 0
 INFO 16:38:47,157 Enqueuing flush of Memtable-LocationInfo@1623810826(38/47 serialized/live bytes, 2 ops)
 INFO 16:38:47,157 Writing Memtable-LocationInfo@1623810826(38/47 serialized/live bytes, 2 ops)
 INFO 16:38:47,237 Completed flushing /home/hpcass/data/node00/system/LocationInfo-hc-3-Data.db (148 bytes)
 INFO 16:38:47,239 Node /10.1.0.23 state jump to normal
 INFO 16:38:47,240 Bootstrap/Replace/Move completed! Now serving reads.
 INFO 16:38:47,241 Will not load MX4J, mx4j-tools.jar is not in the classpath
 INFO 16:38:47,269 Binding thrift service to /10.1.0.23:9160
 INFO 16:38:47,272 Using TFastFramedTransport with a max frame size of 15728640 bytes.
 INFO 16:38:47,274 Using synchronous/threadpool thrift server on /10.1.0.23 : 9160
 INFO 16:38:47,275 Listening for thrift clients...

^Z
[1]+  Stopped                 node00/bin/cassandra -f
hpcass:~/nodes$ bg
[1]+ node00/bin/cassandra -f &

With the process backgrounded, checked the files in the new data directory for my node:

hpcass:~/data/node00$ ls -1 system
LocationInfo-hc-1-Data.db
LocationInfo-hc-1-Digest.sha1
LocationInfo-hc-1-Filter.db
LocationInfo-hc-1-Index.db
LocationInfo-hc-1-Statistics.db
LocationInfo-hc-2-Data.db
LocationInfo-hc-2-Digest.sha1
LocationInfo-hc-2-Filter.db
LocationInfo-hc-2-Index.db
LocationInfo-hc-2-Statistics.db
LocationInfo-hc-3-Data.db
LocationInfo-hc-3-Digest.sha1
LocationInfo-hc-3-Filter.db
LocationInfo-hc-3-Index.db
LocationInfo-hc-3-Statistics.db
Versions-hc-1-Data.db
Versions-hc-1-Digest.sha1
Versions-hc-1-Filter.db
Versions-hc-1-Index.db
Versions-hc-1-Statistics.db

Following that clearing and rebuild, I see the node tool results look a lot better:

hpcass@feed0:~/nodes$ cass00/bin/nodetool -h localhost ring
Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               6138493926725652010223830601932265434881918085
10.1.0.23      datacenter1 rack1       Up     Normal  15.68 KB        33.29%  0                                           
10.1.0.24      datacenter1 rack1       Up     Normal  18.34 KB        30.87%  1927726543429020693034590137790785169819652674
10.1.0.25      datacenter1 rack1       Up     Normal  18.34 KB        35.85%  6138493926725652010223830601932265434881918085

After resetting the old numerated nodes, I had a complete disaster! Negative node tokens? How did that happen? Restarts did nothing to fix this either.

Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               7169015515630842424558524306038950250903273734
10.1.0.27      datacenter1 rack1       Up     Normal  15.79 KB        93.84%  -2742379978670691477635174047251157095949195165
10.1.0.23      datacenter1 rack1       Up     Normal  15.79 KB        86.37%  0                                           
10.1.0.26      datacenter1 rack1       Up     Normal  15.79 KB        77.79%  896682280808232140910919391534960240163386913
10.1.0.24      datacenter1 rack1       Up     Normal  15.79 KB        53.08%  1927726543429020693034590137790785169819652674
10.1.0.25      datacenter1 rack1       Up     Normal  15.79 KB        35.85%  6138493926725652010223830601932265434881918085
10.1.0.28      datacenter1 rack1       Up     Normal  15.79 KB        53.08%  7169015515630842424558524306038950250903273734

To resolve this, I simply re-ran my token generator to get a new set of tokens:

node00	10.1.0.23  token: 0
node01	10.1.0.26  token: 28356863910078205288614550619314017621
node02	10.1.0.24  token: 56713727820156410577229101238628035242
node03	10.1.0.27  token: 85070591730234615865843651857942052863
node04	10.1.0.25  token: 113427455640312821154458202477256070485
node05	10.1.0.28  token: 141784319550391026443072753096570088106

Followed by manually setting the tokens in the ring:

bin/nodetool -h 10.1.0.24 move 56713727820156410577229101238628035242
bin/nodetool -h 10.1.0.25 move 113427455640312821154458202477256070485

bin/nodetool -h 10.1.0.26 move 28356863910078205288614550619314017621
bin/nodetool -h 10.1.0.27 move 85070591730234615865843651857942052863
bin/nodetool -h 10.1.0.28 move 141784319550391026443072753096570088106

This.. gave me the results I was expecting!

Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               141784319550391026443072753096570088106     
10.1.0.23      datacenter1 rack1       Up     Normal  24.95 KB        16.67%  0                                           
10.1.0.26      datacenter1 rack1       Up     Normal  20.72 KB        16.67%  28356863910078205288614550619314017621      
10.1.0.24      datacenter1 rack1       Up     Normal  25.1 KB         16.67%  56713727820156410577229101238628035242      
10.1.0.27      datacenter1 rack1       Up     Normal  13.38 KB        16.67%  85070591730234615865843651857942052863      
10.1.0.25      datacenter1 rack1       Up     Normal  25.1 KB         16.67%  113427455640312821154458202477256070485     
10.1.0.28      datacenter1 rack1       Up     Normal  25.14 KB        16.67%  141784319550391026443072753096570088106   

Now, the question of actually connecting to the cluster can be answered. Pick one of the nodes and ports to connect too. I picked node00 on .23 (cli defaulted to port 9160 so I didn’t have to specify that):

node00/bin/cassandra-cli -h 10.1.0.23 
Connected to: "test-ip" on 10.1.0.23/9160
Welcome to Cassandra CLI version 1.0.8

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

The big problem I had, was that the cli never did seem to respond. The trick is to end your command with a semi-colon. That might seem obvious to you, and generally obvious to me.. but I’d not seen the docs actually call out that little FACT.

[default@unknown] show cluster name;
test-ip

Created a test column family from the helpful Cassandra Wiki.

create keyspace Twissandra;
Keyspace names must be case-insensitively unique ("Twissandra" conflicts with "Twissandra")
[default@unknown] 
[default@unknown] 
[default@unknown] create column family User with comparator = UTF8Type;
Not authenticated to a working keyspace.
[default@unknown] use Twissandra;
Authenticated to keyspace: Twissandra
[default@Twissandra] create column family User with comparator = UTF8Type;
adf453a0-6cb0-11e1-0000-13393ec611bd
Waiting for schema agreement...
... schemas agree across the cluster
[default@Twissandra] 

AND WE’RE OFF!! Next article will cover actually finishing up this last test and then adding real data. MORE TO COME!!

NEXT: Cassandra – A Use case examined (IP data)

One thought on “Re-Configuring an Empty Cassandra Cluster”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.