MemSQL

Infrastructure

2 VMs provisioned on Google Cloud Platform.

  • VM#1
    • Machine type - 8 vCPUs, 30 GB memory
    • CPU platform - Intel Broadwell
    • 50GB SSD persistent disk
    • source image - debian-9-stretch-v20180307
    • MemSQL - Master aggregator and a leaf node
  • VM#2
    • Machine type - 16 vCPUs, 30 GB memory
    • CPU platform - Intel Broadwell
    • 90GB SSD persistent disk
    • source image - debian-9-stretch-v20180307
    • MemSQL - Two leaf nodes

Install MemSQL

Download the software:

cd /tmp
wget http://download.memsql.com/memsql-ops-6.0.10/memsql-ops-6.0.10.tar.gz
tar zxvf memsql-ops-6.0.10.tar.gz

Run the installer:

cd memsql-ops-6.0.10
sudo ./install.sh

Make the second VM follow the first one. At this point MemSQL Ops Web UI should be able to see both VMs as part of cluster:

memsql-ops follow -h <master-vm-ip>

From the MemSQL Ops Web UI, add MemSQL nodes with one master and three leaf node roles across the two VMs.

For starting/stopping the cluster use the below commands from the master VM.:

memsql-ops memsql-stop --all
memsql-ops memsql-start --all

Schema Design

Create database:

create database mymemsqldb;
show databases;
use mymemsqldb;

Create yellow_trips table:

create table mymemsqldb.yellow_trips (
   vendor_id           VARCHAR(5),
   pickup_datetime DATETIME,
   dropoff_datetime DATETIME,
   pickup_longitude DECIMAL(15,8),
   pickup_latitude DECIMAL(15,8),
   dropoff_longitude DECIMAL(15,8),
   dropoff_latitude DECIMAL(15,8),
   rate_code  VARCHAR(5),
   passenger_count    TINYINT,
   trip_distance FLOAT,
   payment_type VARCHAR(5),
   fare_amount FLOAT,
   extra FLOAT,
   mta_tax    FLOAT,
   imp_surcharge FLOAT,
   tip_amount FLOAT,
   tolls_amount FLOAT,
   total_amount FLOAT,
   store_and_fwd_flag VARCHAR(5),
   key (pickup_datetime,dropoff_datetime,vendor_id) using clustered columnstore,
   shard key (pickup_datetime, dropoff_datetime));

Copy yellow taxi dataset from google bucket to VM:

gsutil cp gs://yellow_taxi_trips/data/trips0000000000*.gz .

The above downloads all the ~1000 .gz compressed csv files on to local VM disk.

MemSQL can directly load .gz files in to tables:

load data infile '/home/suds/trips00000000*.gz' into table mymemsqldb.yellow_trips columns terminated by ',' optionally enclosed by '"' lines terminated by '\n'  ignore 1 lines (vendor_id,@pickup_datetime,@dropoff_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,rate_code,passenger_count,trip_distance,payment_type,fare_amount,extra,mta_tax,imp_surcharge,tip_amount,tolls_amount,total_amount,store_and_fwd_flag) set pickup_datetime = STR_TO_DATE(SUBSTR(@pickup_datetime,1,19), '%Y-%m-%d %H:%i:%s'),dropoff_datetime = STR_TO_DATE(SUBSTR(@dropoff_datetime,1,19), '%Y-%m-%d %H:%i:%s');

Average Ingestion rate - ~200,000/s. Total Rows - 1,108,779,463 Total Size - 34 GB (62% compression)

Performance tests

Query#1

How many trips did Yellow taxis took each month?:

SELECT DATE_TRUNC('month',pickup_datetime) month, COUNT(*) trips
  FROM  mymemsqldb.yellow_trips
 GROUP BY  1
 ORDER BY  1;

#1 - 78 rows in set (6.66 sec)
#2 - 78 rows in set (5.90 sec)
#3 - 78 rows in set (5.08 sec)

Query#2

What was the average speed of Yellow taxi trips in 2015 by hour?:

SELECT EXTRACT(HOUR FROM pickup_datetime) hour,
       ROUND(AVG(trip_distance / TIMESTAMPDIFF (SECOND,pickup_datetime,dropoff_datetime))*3600, 1) speed
  FROM mymemsqldb.yellow_trips
 WHERE trip_distance > 0
   AND fare_amount/trip_distance BETWEEN 2 AND 10
   AND dropoff_datetime > pickup_datetime
   AND pickup_datetime >= '2015-01-01' AND pickup_datetime < '2016-01-01'
 GROUP BY  1
 ORDER BY  1;

#1 - 24 rows in set (4.12 sec)
#2 - 24 rows in set (3.22 sec)
#3 - 24 rows in set (1.51 sec)

Query#3

What was the average speed of Yellow taxi trips in 2014 by day of week?

SELECT DAYOFWEEK(pickup_datetime) DAY_OF_WEEK,
       ROUND(AVG(trip_distance / TIMESTAMPDIFF(SECOND,pickup_datetime,dropoff_datetime))*3600, 1) speed
  FROM mymemsqldb.yellow_trips
 WHERE trip_distance > 0
   AND fare_amount/trip_distance BETWEEN 2 AND 10
   AND dropoff_datetime > pickup_datetime
   AND pickup_datetime >= '2014-01-01' AND pickup_datetime < '2015-01-01'
 GROUP BY  1
 ORDER BY  1;

#1 - 7 rows in set (6.61 sec)
#2 - 7 rows in set (4.81 sec)
#3 - 7 rows in set (2.88 sec)

Query#4

Show the first 10 rows offset 100 in 2013 for CMT vendor, trip_distance > 10 and #passengers > 1:

SELECT *
  FROM mymemsqldb.yellow_trips
 WHERE trip_distance > 10
   AND passenger_count > 1
   AND vendor_id = 'CMT'
   AND pickup_datetime >= '2013-01-01' AND pickup_datetime < '2014-01-01'
 ORDER BY pickup_datetime
 LIMIT 10 OFFSET 100;

#1 - 10 rows in set (4.89 sec)
#2 - 10 rows in set (0.66 sec)
#3 - 10 rows in set (0.31 sec)