Rubber meeting the road. Time to insert some column families, then some data and finally pull it back off the stack.
First off, the keyspace was already defined, so I’m going to simply list it’s structure:
[default@unknown] describe ip_store; Keyspace: ip_store: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:2]
With a keyspace ready for some column families, those are created next. Here I’m establishing that there will be 4 families in this single keyspace. This is contrary to suggestions in the High Performance Cassandra Handbook, but follows all other documentation I’ve seen. Considering that this is NOT a production implementation, I’m going to go with a more conventional strategy of organizing related data in the same keyspace.
The first action is to assume the desired keyspace, then add the desired column families:
[default@unknown] use ip_store; Authenticated to keyspace: ip_store [default@ip_store] create column family warehouse with comparator = UTF8Type; 595945d0-71ce-11e1-0000-13393ec611bf Waiting for schema agreement... ... schemas agree across the cluster [default@ip_store] create column family hourly with comparator = UTF8Type; 65ea2170-71ce-11e1-0000-13393ec611bf Waiting for schema agreement... ... schemas agree across the cluster [default@ip_store] create column family daily with comparator = UTF8Type; 6aaeae60-71ce-11e1-0000-13393ec611bf Waiting for schema agreement... ... schemas agree across the cluster [default@ip_store] create column family 30day with comparator = UTF8Type; 7b85bf30-71ce-11e1-0000-13393ec611bf Waiting for schema agreement... ... schemas agree across the cluster
OK, a basic schema has been established. Now.. to load the data. I’ll post the relevant sections of the loader code at a later date. At this point you only need to consider that the loader DOES work and it’s loading data. We’ll look at the extraction of the data following loading a very small set.
time host=10.1.0.23 port=9160 ks=ip_store cf=warehouse ttl=0 datafile=5.ips ant -DclassToRun=loader.bulkIpLoader run Buildfile: cBuild/build.xml init: compile: [javac] Compiling 1 source file to cBuild/build/classes dist: [jar] Building jar: cBuild/dist/lib/cass.jar run: [java] ks ip_store [java] cf warehouse [java] ttl 0 [java] datafile 5.ips BUILD SUCCESSFUL Total time: 1 second
Of the set, there are three unique IPs and 2 are duplicates of other data (IMPORTANT NOTE: The IP’s have been changed to protect the innocent and clueless):
2016468288 1011 suspicious 2012-03-13 18:40:01 2016468288 1011 suspicious 2012-03-13 18:40:02 3149138705 1011 suspicious 2012-03-13 18:40:00 3149138705 1011 suspicious 2012-03-13 18:40:01 2179293112 1011 suspicious 2012-03-13 18:39:59
Having loaded these, I re-launch the command line interface, authenticate to the desired keyspace, and then a VERY important command to set an assumption about how we’re going to reference the keys. If you get a strange error like this “cannot parse ‘187.180.11.17’ as hex bytes“, that means you likely forgot to issue the assumes command. Commands I issued are in bold.
cass Connected to: "ak-ip" on 10.1.0.23/9160 Welcome to Cassandra CLI version 1.0.8 [default@unknown] use ip_store [default@ip_store] assume warehouse keys as utf8; Assumption for column family 'warehouse' added successfully. [default@ip_store] get warehouse['3149138705']; => (column=2012-03-13 18:40:00, value=7b227265706f72746564223a22323031322d30332d31332031383a34383a3031222c22617474726962757465223a22737573706963696f7573222c2270726f705f6964223a2231303131222c2270726f7065727479223a22426f74204375747761696c222c226465746563746564223a22323031322d30332d31332031383a34303a3030222c226d65746164617461223a22222c226970223a223139302e3137342e3235312e313435227d, timestamp=1332168862658) => (column=2012-03-13 18:40:01, value=7b227265706f72746564223a22323031322d30332d31332031383a34383a3031222c22617474726962757465223a22737573706963696f7573222c2270726f705f6964223a2231303131222c2270726f7065727479223a22426f74204375747761696c222c226465746563746564223a22323031322d30332d31332031383a34303a3031222c226d65746164617461223a22222c226970223a223139302e3137342e3235312e313435227d, timestamp=1331689681) Returned 2 results. Elapsed time: 39 msec(s).
There we go. A single key row ip_store[‘warehouse’][‘3149138705’] containing to column records, each with a JSON blob within it. Now.. the next step, to set the assumption of utf8 when recalling the records and get output mere mortals such as yourselves can understand.
[default@ip_store] assume warehouse validator as ascii; Assumption for column family 'warehouse' added successfully. [default@ip_store] t warehouse['3149138705']; => (column=2012-03-13 18:40:00, value={ "reported":"2012-03-13 18:48:01", "attribute":"suspicious", "prop_id":"1011", "detected":"2012-03-13 18:40:00", "ip":"187.180.11.17" }, timestamp=1331689680) => (column=2012-03-13 18:40:01, value={ "reported":"2012-03-13 18:48:01", "attribute":"suspicious", "prop_id":"1011", "detected":"2012-03-13 18:40:01", "ip":"187.180.11.17" }, timestamp=1331689681) Returned 2 results. Elapsed time: 2 msec(s).
There is it! Data written, data read. Now, it’s up to you to think about how you might use this simple, flexible and powerful storage engine to solve your business needs.