Mr Doug Cutting, Chief Architect of Cloudera, shares his views on Hadoop's legacy and future.
The elephant has turned 10, and the proud father is happy to talk about its achievements thus far.
Hadoop’s main legacy, says Mr Doug Cutting, is not so much the technology but the style of innovation that it engenders.
Mr Cutting is the founder of the open source big data project named after his then five-year-old son’s plush toy – a yellow elephant.
He was sharing his experiences with Hadoop in the last installment of the Infocomm Development Authority of Singapore’s Distinguished Speaker Series, which was held at the Shangri-La Hotel on 11 April. IDA's Assistant Chief Executive, Mr Khoong Hock Yun, gave the opening remarks and introduced Mr Cutting to the audience.
In his address, Mr Cutting noted that as an open source platform, Hadoop broke with decades-old methods of software development where technologies were controlled by vendors and standards were slow to evolve.
“When people try to cling closely to controlling things, they limit what those things can become,” he observed.
With open source, it is the users that shape the development of technology. He cited the examples of cluster computing framework Apache Spark and message broker Apache Kafka which emerged as “random mutations” in an evolutionary model where survival and success was determined not by vendors but by the users.
This presents a much more rapid and responsive approach to software development, he said. Developers can directly download the code without having to sign any agreement or contract, and barriers to adoption are much lower.
They can perform their evaluations and if they are confused they can look directly into the code. They can start using it and if they encounter problems and are able to resolve them, they can contribute back to the open source project. “It is a distributed ecosystem, with truly distributed control over the technology stack itself.”
Mr Cutting made his first foray into open source in the late 1990s with a web search engine called Lucene, after his wife’s middle name. He was working for an Internet company and developed Lucene in his spare time as an “insurance policy” against the Internet bubble popping.
And pop it did.
The company he was working for went bankrupt, and Mr Cutting had to decide what he wanted to do next. He had previously licensed Lucene but did not enjoy the experience of negotiating terms and dealing with lawyers.
So he decided, as an experiment, to release it as open source. His fundamental need, he said, was “to see that it was used”.
“I like to write software that people can get value from.”
How Hadoop Happened
Years later, an opportunity arose to work on a web search engine that could operate on a much more massive scale. The project drew on techniques described by Google in its publications on the Google File System and MapReduce.
Google File System is designed to provide efficient, reliable access to data using large clusters of commodity hardware while MapReduce is a programming model that allows for massive scalability across hundreds or thousands of servers in a cluster.
Further development of the project under Yahoo got the web search engine to work reliably on thousands of computers. Mr Cutting named it Hadoop, and Silicon Valley companies like Facebook, Linkedin and Twitter started using it.
“This was the end point I had imagined,” said Mr Cutting.
By 2007, Hadoop was an established project that was going to survive on its own.
The following year, another tantalising challenge presented itself – and that was to bring Hadoop beyond its Silicon Valley adopters to mainstream enterprises. That was when he joined Cloudera as its Chief Architect, to help fill the gap between what Hadoop was and what enterprises needed.
Over the years, a lot has been added to Hadoop. For instance, it now has a good security stack with authentication and finer-grained authorisation and encryption. But what is more interesting, said Mr Cutting, are the projects that have been added around Hadoop such as the data warehouse infrastructure Apache Hive and the distributed non-relational database HBase, amongst many others.
With multiple tools available on a single platform, Hadoop enables a different style of operation. Developers no longer need to spend time designing applications before they start building anything. They can immediately load data and explore its value whilst interactively building their initial application, testing it and improving it over time.
“This new platform is designed to embrace that style of operation, so that businesses can move more efficiently and effectively with their software projects.”
Hadoop adoption has moved just as efficiently and effectively, taking off at a tremendous rate across industries. And this is because “every company is becoming a digital company driven by data”.
Banks use it to better understand their customers, reduce fraud and evaluate applications for loans. Telecommunications companies use it to optimise their network by finding out why calls are being dropped and where to add more cell towers.
It also helps them understand why users switch from one service provider to another. “Hadoop allows them to combine more data sources and work on problems on a much larger scale than they could before,” said Mr Cutting.
With its Smart Nation vision, Singapore is also positioning itself to take advantage of data on an unprecedented scale. Data from the Internet of Things, from connected cars, mobile phones, cameras and other sensors, can provide a high resolution image of people’s activities and enable the public and private sector organisations to improve their services and provide better value to citizens and customers.
Turning to Singapore, Mr Cutting noted that the country is leading the world “Singapore is leading the world in many ways, for example, in the rate of adoption of new technologies,” said Mr Cutting. “The open source ethic is something that works well here because people here have a good notion of public value and behaving responsibly,” he said.
Reiterating a call that he often makes, he said, “We have to think about the ethics of data systems. Privacy is a real concern. As an IT industry, we pay too little attention to it. We assume that technology is all that we care about when in fact we need to think about how the technology is used.”
“The government has a huge role to play by coming up with sensible regulation that allows technology to achieve great things, allows us to grow while behaving responsibly and ethically and respecting individuals,” he added.
Another challenge involves people.
The era of big data will require a big change in mindsets: “People need to learn new skills, and learn new skills regularly. No longer can you have a particular skillset and use that for your entire career.”
He noted that lifelong learning is a direction promoted by the Singapore government as well, in its efforts to develop a smarter workforce.
“Because Singapore has no natural resources except for people who get by on their wits – that is the strength of Singapore and investing in people is the smartest thing that you can be doing.”
What is Hadoop?
Hadoop is an open-source framework that enables large data sets to be processed across a distributed computing environment comprising clusters of commodity hardware. It is supported by the non-profit Apache Software Foundation, which sponsors open source projects.