Dell Statistica Takes the Analytic Man to the Data Mountain

Dell Statistica Takes the Analytic Man to the Data Mountain

Challenges with Analyzing Data from a Distance

Throughout my career in data analysis, analytic processing speed has been a continuing challenge. Every new project seemed to have some wrinkle that impacted speed. Contributing factors have been table size, join count and computational complexity among others. During my stint supporting the financial industry, use of stochastic modeling required row-level calculations that resulted in multiple table scans. Example calculations were sums of percent-weighted variables and var-covar matrices. In a many cases, I hand-scripted stored procedures to compute the analytics close to the data so that the performance was acceptable.

Fast-forward to IoT analytics. If we apply enterprise data analysis practices, we might first transport raw data from edge devices to a central database, then run the analytics against the database. Advantages would be centralized control over analytics and ability to do ad-hoc analysis. However, this approach may break down when performing predictive analytics because turnaround time between a precursor event occurring and prediction of a subsequent more important event might not be sufficiently fast or guaranteed. In particular communication can be negatively impacted by limited bandwidth, spotty up-time and delays from fixed transmission windows.

Because of these data communication issues, I predict that edge- or fog-deployed (fog ~ local area network) predictive analytics will be common for IoT. This is not so different than using stored procedures within databases to accelerate analytic algorithm computation. In particular the likelihood device diversity and complex event profiles combined with potentially high data volumes and problematic data transmission will make ‘predicting from afar’ slow and ineffective. I suppose that elastic scale-out of cloud computing resources could overcome some amount of latency, but even then network bandwidth constraints may make getting high-volume data to the cloud expensive and slow.

Dell Introduces Native Distributed Analytics for Statistica

Recently Dell briefed me on a future enhancement for their Statistica advanced analytics product called Native Distributed Analytics or NDA. The big idea for NDA is to execute algorithm run-times for classification, clustering and so on, close to the data to accelerate processing. This will be especially important for predictive model training because of the row-level computation challenges I cite above. In other words, they’re bringing the analytic man to the data mountain.

Dell’s enhancement roadmap for NDA involves two deployment phases:

For phase one (see figure 1 below), Dell will integrate Statistica with Boomi, their data integration cloud service. This will allow Statistica models to be packaged for JVM-based execution and then pushed to various cloud, fog, and edge run-time environments. The modules will likely be too large to run on device controllers like Arduino or Samsung’s Artik 1 which have 32KB Flash memory and 1MB RAM, respectively. On the other hand, the JVMs will likely be small enough run on edge device architectures like Raspberry Pi 2 Model B, Intel Edison and Samsung’s Artik 5, which have dual or quad core processors and 500MB RAM or more. Also likely destinations for Statistica JVM’s are on-site servers (e.g. at a wind power farm), such as those offered by Dell. SalesForce.com and Teredata databases are also planned run-time destinations.

Dell Statistica Figure 1

Figure 1. Dell Native Distributed Architecture – Phase One

For phase two, Dell will port their Statistica Model Building Environment (SMBE) to multiple platforms including Linux. System requirements will be as with most any client development or server runtime for an enterprise application, namely more RAM and faster multi-core processors.

Conclusions

Deployment of predictive analytics to edge devices is within reach of any data scientist today who has the time, budget and willingness to code in Python. But a democratization wave is coming where technologies like Dell’s NDA (and also Predixion Software’s Deployment Assistant) will take the cost and time out of predictive model deployment. Further, they will allow more refinement iterations following first versions.

IoT analytics system developers will still need to assess the pros and cons of edge/fog deployment of predictive models.  If real-time or near real-time response is a requirement or if data volume is prohibitively large (e.g. vibration signal processing), then local execution of analytics will likely be a core architectural attribute.

Eric Rogge

Eric Rogge is an experienced technology professional with 30+ years with enterprise, business intelligence and data acquisition software and hardware. His unique combination of R&D, marketing and consulting experience provides...

More About Eric Rogge