By Amrita Prasad (ZHAW)
It’s already been a month since we met as the Swiss Data Science community at our 3rd Swiss Conference on Data Science (SDS|2016), pushed again by ZHAW’s Datalab group and presented by SAP Switzerland.
Several additional organisations sponsored and supported the conference to give it a successful execution – the organising committee thanks IT Logix & Microsoft, PwC, Google, Zühlke, SGAICO, Hasler Stiftung and the Swiss Alliance for Data-Intensive Services for their support in bringing together a successful event! Continue reading
By Thilo Stadelmann (ZHAW)
Reposted from https://dublin.zhaw.ch/~stdm/?p=350#more-350
I recently came about the notion of “type A” and “type B” data scientists. While the “type A” is basically a trained statistician that has broadened his field towards modern use cases (“data science for people”), the same is true for “type B” (B for “build”, “data science for software”) that has his roots in programming and contributes stronger to code and systems in the backend.
Frankly, I haven’t come about a practically more useless distinction since the inception of the term “data science”. Data science is the name for a new discipline that is in itself interdisciplinary [see e.g. here – but beware of German text]. The whole point of interdisciplinarity, and by extension of data science, is for proponent to think outside the box of his or her original discipline (which might be be statistics, computer science, physics, economics or something completely different), and acquire skills in the neighboring disciplines in order to tackle problems outside of intellectual silos. Encouraging practitioners to stay in their silos, as this A/B typology suggests, is counterproductive at best, fatal at worst. Continue reading
By Oliver Dürr (ZHAW)
Reposted from http://oduerr.github.io/blog/2016/04/06/Deep-Learning_for_lazybones
In this blog I explore the possibility to use a trained CNN on one image dataset (ILSVRC) as feature extractor for another image dataset (CIFAR-10). The code using TensorFlow can be found at github. Continue reading
The Swiss Data Science community recently met at SDS|2015, the 2nd Swiss Workshop on Data Science. It was a full day event organized by ZHAW Datalab, with inspiring talks, hands-on data expeditions, and an excellent provision of space and atmosphere for fruitful networking. The conference took place on the 12th of June at the premises of ZHAW in Winterthur. It attracted people with a wide range of skills, expertise, and levels from doers to managers, and had very strong support from industry, hence showing the huge potential and scope of the subject.
After having the workshop kicked-off by the President of ZHAW, Dr. Jean Egbert Sturm, Professor of Microeconomics & Director of KOF Swiss Economic Institute at ETH, gave an insightful keynote talk on The use of ever increasing datasets in Macroeconomic forecasting. He explained to the audience the way to do economic forecasting using simple and standard analytical techniques. It was specifically very interesting for data analytics experts to see such a methodology that successfully uses down-to-earth analytical techniques integrated with in-depth knowledge of Economics. Continue reading
In this article, I recount measures and approaches used to deal with a relatively small data set that, in turn, has to be covered “perfectly”. In current academic research and large-scale industrial applications, datasets contain millions to billions (or even more) of documents. While this burdens implementers with considerations of scale at the level of infrastructure, it may make matching comparatively easy: if users are content with a few high quality results, good retrieval effectiveness is simple to attain. Larger datasets are more likely to contain any requested information, linguistically encoded in many different ways, i.e., using different spellings, sentences, grammar, languages, etc.: a “blind shot” will hit a (one of many) target.
However, there are business domains whose main entities’ count will never reach the billions, therefore inherently limiting the document pool. We have recently added another successfully finished CTI-funded project to our track record, which dealt in such a business domain. Dubbed “Stiftungsregister 2.0”, the aim of the project was to create an application which enables users to search for all foundations in existence in Switzerland.
As part of a recent project called Big Data Query Processing we have evaluated complex query workloads using modern Big Data systems. In particular, we have performed benchmarks of Cloudera Impala using a business intelligence use case provided by an industry partner. The results can be found on the following blog post hosted by Cloudera:
In this post, our new Datalab members Kurt Pärli and Anita Zimmermann from ZHAW’s Zurich Center for Privacy and Dataprotection comment on the recent judment of the European court against Google; see also
SDS|2014, the 1st Swiss Workshop on Data Science, took place on the 21st of March, 2014 – and we organized it. You can find an excellent summary of the talks on Frank van Lingen’s blog “ITelligence Insight” (Frank attended the workshop, but is not affiliated with us – so its ought to be a fair review), and the slides are available under the first link above.
So instead of repeating Frank here, I want to do two things: give you some impressions of the day itself, and draw some conclusions:
I’m glad that Thilo mentioned Security & Privacy as part of the data science skill set in his recent blog post. In my opinion, the two most interesting questions with respect to security & privacy in data science are the following:
- Data science for security: How can data science be used to make security-relevant statements, e.g. predicting possible large scale cyber attacks based on analysing communication patterns?
- Privacy for data science: how can data that contains personal identifiable information (PII) be anonymized before providing them to the data scientists for analysis, such that the analyst cannot link data back to individuals? This is typically identified with data anonymization.
This post deals with the second question. I’ll first show why obvious approaches to anonymize data typically don’t offer true anonymity and will then introduce two approaches that provide better protection.
Drew Conway´s data science Venn diagram is used by many (including me) to give a first impression of what data science is all about. And rightly so: I, for example, like it for its simplicity and “coolness”.
When in a more in-depth discussion, moving from mere buzz to concrete skills and project possibilities, we at the Datalab have gained good experiences with the following “skill set map”: