Trade off accuracy against efficiency?

####
tags:
r,
tex,
20 Apr 2018

**Part 1: “Trade off accuracy against efficiency?”**

Abstract - Data, that is ”facts and statistics collected together for reference or analysis”[2], which is happening everywhere, all the time. And in such a high pace, that it leads to enormous volumes. If the magnitude of data gets beyond the possibility of being processable or analysable with ’a normal computer’, we sometimes refer to it as Big Data.

Why would one want to work with Big Data, or even collect it then? The reason is that people tend to think that it is valuable. As data comes in all shapes and sizes, it is not always straight-forward what value there will be, but analysis could lead to correlations and preferences, that are nice to know in businesses applications for example.

This search for information is called Data Mining (next to some other names for the same thing that live in the world of Data Science). Data mining can be time consuming as the data gets large. To define how long a computer (program) needs to run, computer scientists like to express things in asymptotic notation (better known as ”Big O”-notation), that is ”how the running time of an algorithm increases with the size of the input in the limit, as the size of the input increases without bound”[3]. As the size of the data becomes large, and an arbitrary piece of code could get complicated, the processing of Big Data could lead to an unworkable time in asymptotic notation. This essay emphasises approaches that could lead to finding the desired information in a reasonable time, as discussed in the lectures and various papers.

**Part 2: “Experimenting with sample sizes and frequent itemsets”**

Abstract - In the first part of this essay we concluded with the comparison of the two main methods we found for finding sample sizes for frequent item-set mining. In this part we will elaborate on that and implement both methods, from Toivonen and Riondato & Upfal. To compare the methods we will focus on some explicit measurable properties. These will vary from used variables to results.

GitHub repository (access on request)

CS Course: Big Data