Massive, unstructured data repositories offer rich material for predictive analytics. They can also cause a serious drain on overall system capabilities.
GLENDALE, CA, October 15, 2018 – According to a report from the Aberdeen Group, the average company is experiencing data volume growth of more than 50% per year, from an average of 33 different sources. Primary reasons cited for this are to increase operational efficiency; make data available from departmental silos and legacy systems; lower transactional costs; and offload capacity from mainframes or data warehouse operations.1 “The use of what are called ‘data lakes’,” says James D’Arezzo, CEO, Condusiv Technologies, “is an increasing contributor to this staggering growth rate.” D’Arezzo, whose company is a world leader in I/O reduction and SQL database performance, adds, “For organizations to make productive use of data in these volumes, it is vital that their IT managers take steps to optimize basic system functions.”
The term “data lake” is comparatively new, and there is still some confusion between a data lake and a data warehouse. The primary difference is that where the information placed into a data warehouse needs to be structured into folders, rows, and columns, a data lake is a repository for all kinds of data, structured or unstructured. Structure is only applied when the data is queried by a user.2
Many data experts see the use of data lakes as a vital next step in making strategic use of information. “In today’s world,” says Michael Hiskey, Head of Strategy at hub software development firm Semarchy, “a data lake is the foundation of information management. When built successfully, it can empower all end-users, even nontechnical ones, to use data and unlock its power. In a word, the data lake makes data science possible.”3
For all its potential power, however, the great strength of a data lake-its ability to absorb data of virtually any kind from virtually any source-is also a weakness. Its constituent bits and pieces of differently structured data must undergo a considerable amount of processing and preparation before they can be combined and analyzed to produce a meaningful insight, which requires significant system resources. Suppose, explains an industry observer, you’re running a job on Hadoop. Running a machine learning engine could take up quite a few CPU cycles. Real-time analytics, on the other hand, could be extremely memory intensive. Transforming or prepping data for analytics might be equally I/O intensive.4
Meanwhile, notes D’Arezzo, the need for breadth (again, the data lake’s reason for existence) must co-exist with the need for speed. In a world in which the term “big data” is rapidly being replaced by “fast data”5, all organizations are struggling to get the most out of their necessarily limited computational resources. Left unaddressed, system performance will inevitably degrade as data lakes expand.
“The temptation,” D’Arezzo says, “will be to throw money at the problem in the form of additional hardware. But that won’t work, partly because it’s inefficient, and partly because data volume is growing a lot faster than IT budgets. Both financially and in terms of overall system performance, it makes better sense to optimize the capacity of the hardware you already have. We’ve developed software solutions that can improve overall system throughput by 30% to 50%, or more-without the need for new hardware.”
About Condusiv Technologies
Condusiv® Technologies is the world leader in software-only storage performance solutions for virtual and physical server environments, enabling systems to process more data in less time for faster application performance. Condusiv guarantees to solve the toughest application performance challenges with faster-than-new performance via V-locity® for virtual servers or Diskeeper® for physical servers and PCs. With over 100 million licenses sold, Condusiv solutions are used by 90% of the Fortune 1000 and almost three-quarters of the Forbes Global 100 to increase business productivity and reduce data center costs while extending the life of existing hardware. Condusiv Chief Executive Officer Jim D’Arezzo has had a long and distinguished career in high technology.
Condusiv was founded in 1981 by Craig Jensen as Executive Software. Jensen authored Diskeeper, which became the best-selling defragmentation software of all time. Over 37 years, he has taken the thought leadership in file system management and caching and transformed it into enterprise software. For more information, visit https://condusiv.com.
Follow us on Twitter and Like Us on Facebook
- 1. Lock, Michael, “Angling for Insight in Today’s Data Lake,” Aberdeen Group, October 2017.
- 2. Patrizio, Andy, “What is a data lake? Flexible big data management explained,” InfoWorld, September 24, 2018.
- 3. Hiskey, Michael, “Building a Successful Data Lake: An Information Strategy Foundation,” Data Center Knowledge, September 11, 2018.
- 4. Kleyman, Bill, “A Deep Dive Into Data Lakes,” Data Center Frontier, August 29, 2018.
- 5. Kolsky, Esteban, “What to do with the data? The evolution of data platforms in a post big data world,” ZDNet, September 13, 2018.