DMTN-032: Qserv Data Placement

  • Frédéric Gaudet and
  • Fabrice Jammes

Latest Revision: 2017-02-07

1   Galactica setup

1.1   Current status

The Galactica platform https://galactica.isima.fr is an OpenStack computing system which relies on a Ceph cluster storage.

It is currently used for a variety of research projects, including Big Data. Qserv S15 Large Scale Test dataset https://confluence.lsstcorp.org/display/DM/S15+Large+Scale+Tests uses almost 1/3 of our storage capacity and 1/10 of our computing capacity.

_images/galactica-current-architecture.jpg

Currently, the Ceph storage is fully shared between all users and due to the size of Qserv dataset compared to our storage capacity, it has a huge impact on it. This setup has 2 main drawbacks: * Sysadmin can not change any parameter on Ceph without impacting all other projects (like create/delete pools for Qserv), * Qserv has a huge pressure on storage which could leads to slow down all other projects.

As a consequence, Galactica project manager would recommend the following design:

_images/galactica-architecture-proposal.jpg

Galactica project manager propose to share computes servers since the workload is mainly data transfert type and the number of VM is bounded. But the storage itself must be separated on dedicated servers. Only the CEPH OSD (Object Storage Daemon), which manage date hard drives, needs to be on dedicated servers. However, Qserv can use the global Ceph manager. Galactica project manager propose to create one to many dedicated pool on those separates servers, and check which setup is the best.

1.2   Benefits

Galactica project manager could change anything on Ceph parameters without disturbing other projects, If Qserv IOPS (I/O per seconds) are too high, only the OSD which host Qserv data would be impacted.

1.3   Requirements

1.3.1   Hardware

To host S15 dataset, i.e. 35TB of data, we would need 2 additional storage servers. These servers would fit well with our existing platform : * Dell R740xD : 25TB, 14K euros, including 7 years warranty, SLA : Day +1

1.3.2   Manpower

Galactica project manager: * manage the installation and integration of those servers in our existing platform * work with Qserv team to benchmark and improve Qserv for OpenStack/Ceph.

2   Future tracks

2.1   Requirements

Provide a sharp requirements list for Data-Placement is critical in order to offer an efficient and well-fitted solution.

Maximize high availability
SQL user queries must continue in real time, if less than 2% worker nodes crashes.
Minimize overall cost
Costs include infrastructure, sofware development, system administration and maintenance.
Minimize data reconstruction time
A node which crashes must be cloned in a minimal amount of time (data and application)
Prevent data loss
At least one instance of each chunk must be available at any time on disk, even in case 5% of storage is definitly lost. Archive should also be managed by the system in order to reduce cost.
Maximize performances
  • Bottleneck must be identified and quantified for each proposed architecture.
  • Chunk placement combinations can be optimized depending on a given query set.
Support Disaster recovery
At first glance, disaster recovery looks like a separate, likely unrelated concern. However, when considering operations for Qserv instead of just the standing up of Qserv, this all become related. A first requirement is that the data recovery loading method must take into account changes in hardware configs between initial load and recovery.

2.2   Data throughput

Based on http://ldm-135.readthedocs.io/en/master/#requirements

In order to provide a lower bound for data throughput, problem is simplified:

  • low volume queries are ignored.

  • high volume queries are splitted in following groups:

    1. full scan of Object table
    2. joins of the Source tables with the most frequent accessed columns in the Object table
    3. joins of the ForcedSource tables with the most frequent accessed columns in the Object table
    4. scans of the full Object table or joins of it with up to three additional tables other than Source and ForcedSource
Table 1 The estimated data throughput for concurrent high volumes queries
Query type Avg query number at time t Avg query latency (hour) Data read by 1 query (TB)
1 16 1 107
2 12 12 5107
3 12 12 2007
4 16 16 163
Table 2 The estimated data throughput for one node, for a 500 nodes cluster
Query type Data read by 1 query in 1 hour on 1 node (TB)
1 0.214
2 0.8511
3 0.3345
4 0.02025

With optimal shared scans (i.e. each table is only read once for concurrent queries), data throughput on one node is: 394 MB/sec

Without shared scans, data throughput on one node is: 4993 MB/sec