1 Requirements¶
Provide a sharp requirements list for Data-Placement is critical in order to offer an efficient and well-fitted solution.
- Maximize high availability
- SQL user queries must continue in real time, if less than 2% worker nodes crashes.
- Minimize overall cost
- Costs include infrastructure, sofware development, system administration and maintenance.
- Minimize data reconstruction time
- A node which crashes must be cloned in a minimal amount of time (data and application)
- Prevent data loss
- At least one instance of each chunk must be available at any time on disk, even in case 5% of storage is definitly lost. Archive should also be managed by the system in order to reduce cost.
- Maximize performances
- Bottleneck must be identified and quantified for each proposed architecture.
- Chunk placement combinations can be optimized depending on a given query set.
- Support Disaster recovery
- At first glance, disaster recovery looks like a separate, likely unrelated concern. However, when considering operations for Qserv instead of just the standing up of Qserv, this all become related. A first requirement is that the data recovery loading method must take into account changes in hardware configs between initial load and recovery.
2 Data throughput¶
Based on http://ldm-135.readthedocs.io/en/master/#requirements
In order to provide a lower bound for data throughput, problem is simplified:
low volume queries are ignored.
high volume queries are splitted in following groups:
- full scan of Object table
- joins of the Source tables with the most frequent accessed columns in the Object table
- joins of the ForcedSource tables with the most frequent accessed columns in the Object table
- scans of the full Object table or joins of it with up to three additional tables other than Source and ForcedSource
Query type | Avg query number at time t | Avg query latency (hour) | Data read by 1 query (TB) |
---|---|---|---|
1 | 16 | 1 | 107 |
2 | 12 | 12 | 5107 |
3 | 12 | 12 | 2007 |
4 | 16 | 16 | 163 |
Query type | Data read by 1 query in 1 hour on 1 node (TB) |
---|---|
1 | 0.214 |
2 | 0.8511 |
3 | 0.3345 |
4 | 0.02025 |
With optimal shared scans (i.e. each table is only read once for concurrent queries), data throughput on one node is: 394 MB/sec
Without shared scans, data throughput on one node is: 4993 MB/sec