Toward Scalable Systems for Big Data Analytics
Recent technological advancements have led to a deluge of data from distinctive domains(e.g., health care and scientiﬁc sensors, user-generated data, Internet and ﬁnancial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-yourself spirit for advanced audiences to customize their own big-data solutions. First, we present the deﬁnition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.
The emerging big-data paradigm,owing to its broader impact, has profoundly transformed our society and will continue to attract diverse attentions from both technological experts and the public in general. It is obvious that we are living a data deluge era, evidenced by the sheer volume of data from a variety of sources and its growing rate of generation. For instance, an IDC report predicts that, from 2005 to 2020, the global data volume will grow by a factor of 300, from 130 exabytes to 40,000 exabytes, representing a double growth every two years. The term of ‘‘big-data’’ was coined to capture the profound meaning of this data-explosion trend and indeed the data has been touted as the new oil, which is expected to transform our society. For example, a Mckinsey report states that the potential value of global personal
location data is estimated to be $100 billion in revenue to service providers over the next ten years and be as much as $700 billion in value to consumer and business end users. The huge potential associated with big-data has led to an emerging research ﬁeld that has quickly attracted tremendous interest from diverse sectors, for example, industry, government and research community. The broad interest is ﬁrst exempliﬁed by coverage on both industrial reports and public media e.g.,the Economist the New York Times, and the National Public Radio. Government has also played a major role in creating new programs to accelerate the progress of tackling the bigdata challenges. Finally, Nature and Science Magazines have published special issues to discuss the big-data phenomenon and its challenges,expanding its impact beyond technological domains. As a result, this growing interest in big-data from diverse domains demands a clear and intuitive understanding of its deﬁnition, evolutionary history, building technologies and potential challenges. This tutorial paper focuses on scalable big-data systems, which include a set of tools and mechanisms to load, extract, and improve disparate data while leveraging the massively parallel processing power to perform complex transformations and analysis. Owing to the uniqueness of big-data, designing a scalable big-data system faces a series of technical challenges, including:
• First,due to the variety of disparate data sources and the sheer volume, it is difﬁcult to collect and integrate data with scalability from distributed locations. For instance, more than 175 million tweets containing text, image, video, social relationship are generated by millions of accounts distributed globally.
• Second, big data systems need to store and manage the gathered massive and heterogeneous datasets, while provide function and performance guarantee, in terms of fast retrieval, scalability, and privacy protection. For example, Facebook needs to store, access, and analyze over 30 pertabytes of user generate data.
• Third, big data analytics must effectively mine massive datasets at different levels in realtime or near realtime – including modeling, visualization, prediction, and optimization – such that inherent promises can be revealed to improve decision making and acquire further advantages.
These technological challenges demand an overhauling re-examination of the current data management systems, ranging from their architectural principle to the implementation details. Indeed, many leading industry companies have discarded the transitional solutions to embrace the emerging big data platforms. However, traditional data management and analysis systems, mainly based on relational database management system (RDBMS), are inadequate in tackling the aforementioned list of big-data challenges. Speciﬁcally, the mismatch between the traditional RDBMS and the emerging big-data paradigm falls into the following two aspects, including: • From the perspective of data structure, RDBMSs can only support structured data, but offer little support for semi-structured or unstructured data.
• From the perspective of scalability, RDBMSs scale up with expensive hardware and cannot scale out with commodity hardware in parallel,which is unsuitable to cope with the ever growing data volume. To address these challenges, the research community and industry have proposed various solutions for big data systems in an ac-hoc manner. Cloud computing can be deployed as the infrastructure layer for big data systems to meet certain infrastructure requirements, such as cost-effectiveness, elasticity, and the ability to scale up or down. Distributed ﬁle systems and NoSQL databases are suitable for persistent storage and the management of massive schemefree datasets. MapReduce, a programming framework, has achieved great success in processing group-aggregation tasks, such as website ranking. Hadoop integrates data storage, data processing, system management, and other modules to form a powerful system-level solution, which is becoming the mainstay in handling big data challenges. We can construct various big data applications based on these innovative technologies and platforms.In light of the proliferation of big-data technologies,asystematic framework should be in order to capture the fast evolution of big-data research and development efforts and put the development in different frontiers in perspective.
In this Article, learning from our ﬁrst-hand experience of building a big-data solution on our private modular data center testbed, we strive to offer a systematic tutorial for scalable big-data systems, focusing on the enabling technologies and the architectural principle. It is our humble expectation that the paper can serve as a ﬁrst stop for domain experts, big-data users and the general audience to look for information and guideline in their speciﬁc needs for big-data solutions. For example, the domain experts could follow our guideline to develop their own bigdata platform and conduct research in big-data domain; the big-data users can use our framework to evaluate alternative solution sproposed by their vendors;and the general audience can understand the basic of big-data and its impact on their work and life. For such a purpose, we ﬁrst present a list of alternative deﬁnitions of big data, supplemented with the history of big-data and big-data paradigms. Following that, we introduce a generic framework to decompose big data platforms into four components, i.e., data generation, data acquisition,data storage,and data analysis.For each stage,we survey current research and development efforts and provide engineering insights for architectural design. Moving toward a speciﬁc solution, we then delve on Hadoop – the de facto choice for big data analysis platform,and provide benchmark results for big-data platforms.
- J. Gantz and D. Reinsel, ‘‘The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,’’ in Proc. IDC iView, IDC Anal. Future, 2012.
- J.Manyikaetal.,Bigdata:TheNextFrontierforInnovation,Competition, and Productivity. San Francisco, CA, USA: McKinsey Global Institute, 2011, pp. 1–137.
- K. Cukier, ‘‘Data, data everywhere,’’ Economist, vol. 394, no. 8671, pp. 3–16, 2010.
- T. economist. (2011, Nov.) Drowning in Numbers—Digital Data Will Flood the Planet- and Help us Understand it Better [Online]. Available: http://www.economist.com/blogs/dailychart/2011/11/bigdata-0
- S. Lohr. (2012). The age of big data. New York Times [Online]. 11. Available: http://www.nytimes.com/2012/02/12/sunday-review/big-datasimpact-in-the-world.html?pagewanted=all&r=0