Big data solutions are only as good the technology they’re built on. These technology components address a number of key areas, such as the solution’s File System, its Data Store, Data Integration techniques, High Performance Computing requirements and Analytics & Visualization. There is a wide variety of tools and technologies available to address the needs in each of these areas, for example:
The Hadoop Distributed File System or HDFS, is the heart of any Big DATA solution. It is a 100% open source file system and utilizes a highly optimized way of storing, retrieving, replicating & processing huge volumes of data. What’s most fascinating about the HDFS is that it is able to work with inexpensive, heterogeneous hardware. This makes it a highly scalable and reliable file system, one which is capable of delivering high performance, which is why it is so widely used in Big DATA solutions.
In terms of the data store or database, there are a number of options available which we can use since data can be stored in both traditional SQL based Database Management Systems and NoSQL databases (for data that is diverse in format & changes frequently). As such the database options available for Big Data solutions include technologies for both types of databases such as:
Apache HBase – an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java. It runs on top of the HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop, i.e. providing a fault-tolerant way of storing large quantities of data. It features compression, in-memory operation, and Bloom filters on a per-column basis. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs.
Due to its impressive performance, it is currently in use by several data-driven websites including Facebook’s Messaging Platform.
Apache Cassandra – an open source distributed database management system designed to handle very large amounts of data spread out across many commodity servers, while providing a highly available service with no single point of failure. This is possible due to Cassandra’s ability to be distributed in multiple geographical locations, data centers or the cloud.
Cassandra provides a structured key-value store with tunable consistency. Keys map to multiple values, which are grouped into column families. The column families are fixed when a Cassandra database is created, but columns can be added to a family at any time. Furthermore, columns are added only to specified keys, so different keys can have different numbers of columns in any given family. The values from a column family for each key are stored together. This makes Cassandra a hybrid data management system between a column-oriented DBMS & a row-oriented store.
However its data redundancy features are its most desirable ones and make it a popular choice in Big Data Solutions.
Apache Hive – a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Apache Hive supports analysis of large datasets stored in Hadoop-compatible file systems such as Amazon S3 files system. It provides an SQL-like language called HiveQL and eases data access from Hadoop by compiling SQL statements into MapReduce jobs and running them across the Hadoop cluster. It also provides query acceleration capabilities, via data indexing support (including support for bitmap indexes).
MongoDB is an open source document-oriented database system. It is part of the NoSQL family of database systems and stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. It is the most popular NoSQL database management system and is used by several Fortune 500 clients and media companies such as Foursquare, MTV Networks and Craigslist.
For High Performance Computing one of the most popular technologies used is MapReduce. MapReduce is a programming model for processing parallelizable problems across huge data-sets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured). One of its key benefits is that it can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.
For data integration, the most common tool used is Sqoop. Sqoop is software for transferring data between relational databases and Hadoop. It can work with both relational or NoSQL database and is used to load tables or entire databases between Hadoop and these databases as it can generate JAVA classes to provide an interface to the data.
When it comes to Analytics & Visualization, the most widely adopted script is R Programming Language, an open source programming language and software environment for statistical computing and graphics. It includes all the all kinds of data manipulation, statistical modeling and charting solutions and is widely used among statisticians and data miners for developing statistical software and data analysis.
R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. It is highly extensible through the use of user-submitted packages for specific functions or specific areas of study. Extending R is also eased by its permissive lexical scoping rules.
Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols. Dynamic and interactive graphics are available through additional packages. R also supports procedural programming with functions and, for some functions, object-oriented programming with generic functions.
Another technology option available for Analytics is Pentaho business analytics, an open source solution that provides data integration, OLAP services, reporting, dashboards, data mining and ETL capabilities. The Pentaho suite consists of two offerings, an enterprise and community edition. The enterprise edition contains extra features not found in the community edition. However Pentaho’s core offering can be easily enhanced by add-on products, usually in the form of plugins, from the company itself and also the broader community of users and enthusiasts. That is what makes it so popular with Big Data Solution Architects.
With so many variables to consider, Big Data Solutions require extensive knowledge and significant expertise to develop and implement. At Folio3, we offer a wide variety of big data services. We have deployed big data solutions at a number of our clients and can offer a solution tailored to your specific needs. For more information, please visit our Business Intelligence Services page.