Considerations
Security
Using hardened APIs and best practice security measures corresponding to the environment.
Reliability
The Data Store is required to respond consistently to Read and Write requests.
Speed!
The storage solution needs to accommodate IO operations at a speed that the application demands to maintain user experience and data concurrency
Scalability
An appropriate storage solution selection (ex: DAS / SAN / NAS) which can scale well into the future demands of the application and within the environment constraints.
Backup
A Robust backup solution design for minimal HW footprint with appropriate point in time recovery, system restore time, geo-redundancy and replication factor.
Fault Tolerance
Rugged fault tolerance built into both the hardware and software layer.
How can we achieve this?
– Storage Path Optimisation
Keep the data as close as possible to where it’s being used
– Accelerated Data Access
- Multidimensional Caching
- Hardware Level (ex. Storage Controller Cache / SAN Cache Pool)
- Storage Fabric Level
- System Service Level
- Application Data Access Layer / API Level
- In-Memory Datasets and Indexes
- Adaptive Data Compression, De-duplication, Preallocation
– Data Classification for performance and cost efficiency
- From simple data access frequency or age based to complex, pattern based or statistical predictive algorithms,
- Or Data Type based classification for Object, Block and File Storage
- Example:
Hot Data ~ In Memory
Extremely high frequency
Semi-structured transactional data
Preprocessing subsets
Warm Data ~ Flash Disk (SSD)
High Frequency
Structured Subsets
Cold Data ~ Fast Disk Array
Low Frequency
Structured Subsets
ICY Data ~ Slow Disk Array / Remote
Very low frequency
Structured
Compressed Subsets
Frozen Data ~ Tape Library
Archive
It is necessary to identify differences in access patterns to the various pieces of data in order to ensure that the appropriate storage solution is chosen for each type.
Data which is accessed or updated frequently can be classed as hot.
Data which is accessed or updated occasionally can be classified as cold and warm being somewhere in between.
These different classifications can allow us to further differentiate the replication factor and access speed required for the different data areas.
– Reduce Disk IO needed for each data request
In case of BLOB (binary large object) storage we can use large volumes (~100GB) with an in-memory index.
The 100GB Volume can store a number of images say with their respective location in the volume known by the index which is held in the storage node’s memory for quick access.
– Using SSDs
At Avinton we design our solutions where we place the HOT data on SSD arrays and the cold data on the spinning disks. In cases where it is not immediately apparent which data is hot or cold we gather meta data on the files or tables in order to understand the number of reads, updates, index scans and so on which will then allow us to isolate the hot data.
In some cases data classification (HOT / Warm / Cold) is relative to age so newer data will be HOT while the older data is expired onto the Cold storage area.
– In-Memory Index
In scenarios where the data volumes are really large we also use in-memory indexes – typically in the form of key-value pairs. With the recent improvements in the reliability of in-memory key-value pair solutions with persistence we are able to achieve significant performance gains with minimal risk.
Conclusions
– Improved Application Performance
Done right the application of such techniques will improve the data response speed significantly and is often part of the solution for long running queries. In some cases we are able to improve the storage performance while avoiding a costly hardware upgrade.
– Data Schema Considerations
In the case of any storage solution one cannot rely on these techniques alone. A good schema in the case of an RDBMS data warehouse is key for having a responsive solution. Other areas to look at are bottlenecks on the data input and output interfaces (be it SCSI / SAS / IP).
– Why Avinton?
Avinton are by no means pioneers in this area – similar techniques are used by Google, Facebook, Yahoo and many other big players. We have simply mastered these techniques having been using them throughout the years starting from our early Telecom monitoring solutions which are still in use today.
Final Thoughts
To design a good data storage solution the following are necessary:
– Know your data (HOT vs Cold – Structure, Size, Types etc..)
– Know your users (#Simultaneous Users, Types of queries)
– Detailed knowledge of HW (Server vendor specific HW options)
– Good working knowledge of the storage technique in use (be it DB or File based storage)
– Appropriate Storage Solution Selection (DAS / SAN / NAS)
Having a scalable Big Data Storage solution that allows you to leverage data insights efficiently is fundamental since having a lot of data which is slow to retrieve diminishes its value.
Avinton have designed and delivered various data solutions including both RDBMS (PostgreSQL & ORACLE) and hybrid RDBMS & file based solutions on HDFS (Hadoop).
We offer an End to End service from Design > Dimensioning > Implementation > Deployment > SLA based Support.
A common theme throughout this article is that Avinton’s solutions feature design considerations for improved IO performance both on the Software and Hardware level. This stems from our philosophy that to design high performance big data solutions one has to have a good understanding of the underlying hardware.
Our Research, Development and Testing work at our development and training centre in Yokohama is where we test new hardware configurations and combine them with well known big data solutions like our latest project with Spark on Hadoop. This allows us to bring our clients tailored solutions based on test result data.
Our Research, Development and Testing experience Enables us to:
- Deliver optimised HW / SW platform combinations
- Reduce time to market
- Heavily Tailor the solution to our client’s design requirements
- Provide SLA based HW & Application support
We are passionate about data and welcome any enquiries in this regard.