How William Hill migrated NoSQL workloads at scale to Amazon Keyspaces


Social gaming and on-line sports activities betting are aggressive environments. The sport should be capable of deal with massive volumes of unpredictable site visitors whereas concurrently promising zero downtime. On this area, person retention is not simply fascinating, it’s crucial. William Hill is a worldwide on-line playing firm primarily based in London, England, and it’s the founding member of the UK Betting and Gaming Council. They share the mission to champion the betting and gaming {industry} and set world-class requirements to ensure of an satisfying, truthful, and secure betting and playing expertise for all of their clients. In sports activities betting, William Hill is an industry-leading model, awarded with prestigious {industry} titles just like the IGA Awards Sports activities Betting Operator of the 12 months in 2019, 2020, and 2022, and the SBC Awards Racing Sportsbook of the Yr in 2019. William Hill has been acquired by Caesars Leisure, Inc (NASDAQ: CZR) in April 2021, and it’s the biggest casino-entertainment firm within the US and one of many world’s most diversified casino-entertainment suppliers. On the coronary heart of William Hill gaming platform is a NoSQL database that maintains 100% uptime, scales in real-time to deal with hundreds of thousands of customers or extra, and gives customers with a responsive and personalised expertise throughout all of their gadgets.

On this publish, we’ll focus on how William Hill moved their workload from Apache Cassandra to Amazon Keyspaces (for Apache Cassandra) with zero downtime utilizing AWS Glue ETL.

William Hill was going through challenges relating to scalability, cluster instability, excessive operational prices, and handbook patching and server upkeep. They have been on the lookout for a NoSQL resolution which was scalable, highly-available, and fully managed. This allow them to deal with offering higher person expertise somewhat than sustaining infrastructure. William Hill Restricted determined to maneuver ahead with Amazon Keyspaces, since it will probably run Apache Cassandra workloads on AWS utilizing the identical Cassandra utility code and developer instruments used as we speak, with out the necessity to provision, patch, handle servers, set up, keep, or function software program.

Answer overview

William Hill Restricted needed emigrate their present Apache Cassandra workloads to Amazon Keyspaces with a replication lag of minutes, with minimal migration prices and growth efforts. Subsequently, AWS Glue ETL was leveraged to ship the specified consequence.

AWS Glue is a serverless information integration service that gives a number of advantages for migration:

  • No infrastructure to take care of; allocates the required computing energy and runs a number of migration jobs concurrently.
  • All-in-one pricing mannequin that features infrastructure and is 55% cheaper than different cloud information integration choices.
  • No lock in with the service; potential to develop information migration pipelines in open-source Apache Spark (Spark SQL, PySpark, and Scala).
  • Migration pipeline may be scaled fearlessly with Amazon Keyspaces and AWS Glue.
  • Constructed-in pipeline monitoring to ensure of in-migration continuity.
  • AWS Glue ETL jobs make it potential to carry out bulk information extraction from Apache Cassandra and ingest to Amazon Keyspaces.

On this publish, we’ll take you thru William Hill’s journey of constructing the migration pipeline from scratch emigrate the Apache Cassandra workload to Amazon Keyspaces by leveraging AWS Glue ETL with DataStax Spark Cassandra connector.

For the aim of this publish, let’s take a look at a typical Cassandra Community setup on AWS and the mechanism used to set up the reference to AWS Glue ETL. The migration resolution described additionally works for Apache Cassandra hosted on on-premises clusters.

Structure overview

The structure demonstrates the migration setting that requires Amazon Keyspaces, AWS Glue, Amazon Easy Storage Service (Amazon S3), and the Apache Cassandra cluster. To keep away from a excessive CPU utilization/saturation on the Apache Cassandra cluster in the course of the migration course of, you may need to deploy one other Cassandra datacenter to isolate your manufacturing from the migration workload to make the migration course of seamless in your clients.

Amazon S3 has been used for staging whereas migrating information from Apache Cassandra to Amazon Keyspaces to guarantee that the IO load on Cassandra serving stay site visitors on manufacturing is minimized, in case the information add to Amazon Keyspaces fails and a retry have to be achieved.

Stipulations

The Apache Cassandra cluster is hosted on Amazon Elastic Compute Cloud (Amazon EC2) cases, unfold throughout three availability zones, and hosted in non-public subnets. AWS Glue ETL is hosted on Amazon Digital Non-public Cloud (Amazon VPC) and thus wants a AWS Glue Studio {custom} Connectors and Connections to be setup to speak with the Apache Cassandra nodes hosted on the non-public subnets within the buyer VPC. Thereby, this allows the connection to the Cassandra cluster hosted within the VPC. The DataStax Spark Cassandra Connector have to be downloaded and saved onto an Amazon S3 bucket: s3://$MIGRATION_BUCKET/jars/spark-cassandra-connector-assembly_2.12-3.2.0.jar.

Let’s create an AWS Glue Studio {custom} connector named cassandra_connection and its corresponding connection named conn-cassandra-custom for AWS area us-east-1.

For the connector created, create an AWS Glue Studio connection and populate it with community data VPC, and a Subnet permitting for AWS Glue ETL to ascertain a reference to Apache Casandra.

  • Identify: conn-cassandra-custom
  • Community Choices

Let’s start by creating a keyspace and desk in Amazon Keyspaces utilizing Amazon Keyspaces Console or CQLSH, after which create a goal keyspace named target_keyspace and a goal desk named target_table.

CREATE KEYSPACE target_keyspace WITH replication = {'class': 'SingleRegionStrategy'};

CREATE TABLE target_keyspace.target_table (
    userid      uuid,
    stage       textual content,
    gameid      int,
    description textual content,
    nickname    textual content,
    zip         textual content,
    e mail       textual content,
    updatetime  textual content,
    PRIMARY KEY (userid, stage, gameid)
) WITH default_time_to_live = 0 AND CUSTOM_PROPERTIES = {
	'capacity_mode':{
		'throughput_mode':'PROVISIONED',
		'write_capacity_units':76388,
		'read_capacity_units':3612
	}
} AND CLUSTERING ORDER BY (stage ASC, gameid ASC);

After the desk has been created, change the desk to on-demand mode to pre-warm the desk and keep away from AWS Glue ETL job throttling failures. The next script will replace the throughput mode.

ALTER TABLE target_keyspace.target_table 
WITH CUSTOM_PROPERTIES = {
	'capacity_mode':{
		'throughput_mode':'PAY_PER_REQUEST'
	}
} 

Let’s go forward and create two Amazon S3 buckets to assist the migration course of. The primary bucket (s3://your-spark-cassandra-connector-bucket-name)ought to retailer the spark Cassandra connector meeting jar file, Cassandra, and Keyspaces configuration YAML recordsdata.

The second bucket (s3://your-migration-stage-bucket-name) might be used to retailer intermediate parquet recordsdata to establish the delta between the Cassandra cluster and the Amazon Keyspaces desk to trace modifications between subsequent executions of the AWS Glue ETL jobs.

Within the following KeyspacesConnector.conf, set your contact factors to connect with Amazon Keyspaces, and substitute the username and the password to the AWS credentials.

Utilizing the RateLimitingRequestThrottler we will guarantee that requests don’t exceed the configured Keyspaces capability. The G1.X DPU creates one executor per employee. The RateLimitingRequestThrottler on this instance is about for 1000 requests per second. With this configuration, and G.1X DPU, you’ll obtain 1000 request per AWS Glue employee. Modify the max-requests-per-second accordingly to suit your workload. Enhance the variety of staff to scale throughput to a desk.

datastax-java-driver {
  primary.request.consistency = "LOCAL_QUORUM"
  primary.contact-points = ["cassandra.us-east-1.amazonaws.com:9142"]
   superior.reconnect-on-init = true
   primary.load-balancing-policy {
        local-datacenter = "us-east-1"
    }
    superior.auth-provider = {
       class = PlainTextAuthProvider
       username = "user-at-sample"
       password = "[email protected]=PASSWORD="
    }
    superior.throttler = {
       class = RateLimitingRequestThrottler
       max-requests-per-second = 1000
       max-queue-size = 50000
       drain-interval = 1 millisecond
    }
    superior.ssl-engine-factory {
      class = DefaultSslEngineFactory
      hostname-validation = false
    }
    superior.connection.pool.native.measurement = 1
}

Equally, create a CassandraConnector.conf file, set the contact factors to connect with the Cassandra cluster, and substitute the username and the password respectively.

datastax-java-driver {
  primary.request.consistency = "LOCAL_QUORUM"
  primary.contact-points = ["127.0.0.1:9042"]
   superior.reconnect-on-init = true
   primary.load-balancing-policy {
        local-datacenter = "datacenter1"
    }
    superior.auth-provider = {
       class = PlainTextAuthProvider
       username = "user-at-sample"
       password = "[email protected]=PASSWORD="
    }
}

Construct AWS Glue ETL migration pipeline with Amazon Keyspaces

To construct dependable, constant delta add Glue ETL pipeline, let’s decouple the migration course of into two AWS Glue ETLs.

  • CassandraToS3 Glue ETL: Learn information from the Apache Cassandra cluster and switch the migration workload to Amazon S3 within the Apache Parquet format. To establish incremental modifications within the Cassandra tables, the job shops separate parquet recordsdata with major keys with an up to date timestamp.
  • S3toKeyspaces Glue ETL: Uploads the migration workload from Amazon S3 to Amazon Keyspaces. Through the first run, the ETL uploads the entire information set from Amazon S3 to Amazon Keyspaces, and for the following run calculates the incremental modifications by evaluating the up to date timestamp throughout two subsequent runs and calculating the incremental distinction. The job additionally takes care of inserting new data, updating present data, and deleting data primarily based on the incremental distinction.

On this instance, we’ll use Scala to write down the AWS Glue ETL, however you can too use PySpark.

Let’s go forward and create an AWS Glue ETL job named CassandraToS3 with the next job parameters:

aws glue create-job 
    --name "CassandraToS3" 
    --role "GlueKeyspacesMigration" 
    --description "Offload information from the Cassandra to S3" 
    --glue-version "3.0" 
    --number-of-workers 2 
    --worker-type "G.1X" 
    --connections "conn-cassandra-custom" 
    --command "Identify=glueetl,ScriptLocation=s3://$MIGRATION_BUCKET/scripts/CassandraToS3.scala" 
    --max-retries 0 
    --default-arguments '{
        "--job-language":"scala",
        "--KEYSPACE_NAME":"source_keyspace",
        "--TABLE_NAME":"source_table",
        "--S3_URI_FULL_CHANGE":"s3://$MIGRATION_BUCKET/full-dataset/",
        "--S3_URI_CURRENT_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/present/",
        "--S3_URI_NEW_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/new/",
        "--extra-files":"s3://$MIGRATION_BUCKET/conf/CassandraConnector.conf",
        "--conf":"spark.cassandra.connection.config.profile.path=CassandraConnector.conf",
        "--class":"GlueApp"
    }'

The CassandraToS3 Glue ETL job reads information from the Apache Cassandra desk source_keyspace.source_table and writes it to the S3 bucket within the Apache Parquet format. The job rotates the parquet recordsdata to assist establish delta modifications within the information between consecutive job executions. To establish inserts, updates, and deletes, you should know major key and columns write occasions (up to date timestamp) within the Cassandra cluster up entrance. Our major key consists of a number of columns userid, stage, gameid, and a write time column updatetime. You probably have a number of up to date columns, then you should use multiple write time columns with an aggregation perform. For instance, for e mail and updatetime, take the utmost worth between write occasions for e mail and updatetime.

The next AWS Glue spark code offloads information to Amazon S3 utilizing the spark-cassandra-connector. The script takes 4 parameters KEYSPACE_NAME, KEYSPACE_TABLE, S3_URI_CURRENT_CHANGE, S3_URI_CURRENT_CHANGE, and S3_URI_NEW_CHANGE.

To add the information from Amazon S3 to Amazon Keyspaces, you should create a S3toKeyspaces Glue ETL job utilizing the Glue spark code to learn the parquet recordsdata from the Amazon S3 bucket created as an output of CassandraToS3 Glue job and establish inserts, updates, deletes, and execute requests in opposition to the goal desk in Amazon Keyspaces. The code pattern supplied takes 4 parameters: KEYSPACE_NAME, KEYSPACE_TABLE, S3_URI_CURRENT_CHANGE, S3_URI_CURRENT_CHANGE, and S3_URI_NEW_CHANGE.

Let’s go forward and create our second AWS Glue ETL job S3toKeyspaces with the next job parameters:

aws glue create-job 
    --name "S3toKeyspaces" 
    --role "GlueKeyspacesMigration" 
    --description "Push information to Amazon Keyspaces" 
    --glue-version "3.0" 
    --number-of-workers 2 
    --worker-type "G.1X" 
    --command "Identify=glueetl,ScriptLocation=s3://amazon-keyspaces-backups/scripts/S3toKeyspaces.scala" 
    --default-arguments '{
        "--job-language":"scala",
        "--KEYSPACE_NAME":"target_keyspace",
        "--TABLE_NAME":"target_table",
        "--S3_URI_FULL_CHANGE":"s3://$MIGRATION_BUCKET/full-dataset/",
        "--S3_URI_CURRENT_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/present/",
        "--S3_URI_NEW_CHANGE":"s3://$MIGRATION_BUCKET/incremental-dataset/new/",
        "--extra-files":"s3://$MIGRATION_BUCKET/conf/KeyspacesConnector.conf",
        "--conf":"spark.cassandra.connection.config.profile.path=KeyspacesConnector.conf",
        "--class":"GlueApp"
    }'

Job scheduling

The ultimate step is to configure AWS Glue Triggers or Amazon EventBridge relying in your scheduling must set off S3toKeyspaces Glue ETL when the job CassandraToS3 has succeeded. If you wish to run the CassandraToS3 primarily based on the schedule and configure the schedule possibility, then the next instance showcases find out how to schedule cassandraToS3 to run each quarter-hour.

Job tuning

There are Spark settings advisable to start with Amazon Keyspaces, which might then be elevated later as acceptable in your workload.

  • Use a Spark partition measurement (teams a number of Cassandra rows) smaller than 8 MBs to keep away from replaying massive Spark duties throughout a activity failure.
  • Use a low concurrent variety of writes per DPU with a lot of retries. Add the next choices to the job parameters: --conf spark.cassandra.question.retry.depend=500 --conf spark.cassandra.output.concurrent.writes=3.
  • Set spark.activity.maxFailures to a bounded worth. For instance, you can begin from 32 and improve as wanted. This selection may also help you improve various duties reties throughout a desk pre-warm stage. Add the next choice to the job parameters: --conf spark.activity.maxFailures=32
  • One other advice is to show off batching to enhance random entry patterns. Add the next choices to the job parameters:
    spark.cassandra.output.batch.measurement.rows=1
    spark.cassandra.output.batch.grouping.key=none spark.cassandra.output.batch.grouping.buffer.measurement=100
  • Randomize your workload. Amazon Keyspaces partitions information utilizing partition keys. Though Amazon Keyspaces has built-in logic to assist load steadiness requests for a similar partition key, loading the information is quicker and extra environment friendly if you happen to randomize the order as a result of you possibly can benefit from the built-in load balancing of writing to totally different partitions. To unfold the writes throughout the partitions evenly, you should randomize the information within the dataframe. You may use a rand perform to shuffle rows within the dataframe.

Abstract

William Hill was capable of migrate their workload from Apache Cassandra to Amazon Keyspaces at scale utilizing AWS Glue, with out the must make any modifications on their utility tech stack. The adoption of Amazon Keyspaces has supplied them with the headroom to deal with their Utility and buyer expertise, as with Amazon Keyspaces there’s no have to handle servers, get efficiency at scale, highly-scalable, and safe resolution with the flexibility to deal with the sudden spike in demand.

On this publish, you noticed find out how to use AWS Glue emigrate the Cassandra workload to Amazon Keyspaces, and concurrently maintain your Cassandra supply databases fully useful in the course of the migration course of. When your purposes are prepared, you possibly can select to chop over your purposes to Amazon Keyspaces with minimal replication lag in sub minutes between the Cassandra cluster and Amazon Keyspaces. You may also use an analogous pipeline to duplicate the information again to the Cassandra cluster from Amazon Keyspaces to take care of information consistency, if wanted. Right here you could find the paperwork and code to assist speed up your migration to Amazon Keyspaces.


Concerning the Authors

Nikolai Kolesnikov is a Senior Knowledge Architect and helps AWS Skilled Companies clients construct highly-scalable purposes utilizing Amazon Keyspaces. He additionally leads Amazon Keyspaces ProServe buyer engagements.

Kunal Gautam is a Senior Massive Knowledge Architect at Amazon Net Companies. Having expertise in constructing his personal Startup and dealing together with enterprises, he brings a novel perspective to get folks, enterprise and expertise work in tandem for patrons. He’s keen about serving to clients of their digital transformation journey and permits them to construct scalable information and advance analytics options to realize well timed insights and make crucial enterprise selections. In his spare time, Kunal enjoys Marathons, Tech Meetups and Meditation retreats.

Leave a Reply