MKT439 (Big Data for Marketing)
This is assignment is to be completed individually.
create a new sheet and shortly answer this questions Consider an area of interest or business idea you have thought of (e.g. an e-commerce fashion store, a family business, etc.). What data architecture (RDBMS or NoSQL with Hadoop) implementation would you say is best suited for this scenario? Why? Furthermore, what type of analytics do you think would be most useful on the data (and which wouldn’t be)? These types include descriptive, diagnostic, predictive, and prescriptive analytics. Why?
You have been hired by a Bangkok firm’s Business Intelligence department. You have been tasked with streamlining and automating where possible the steps involved in the preparation of data required for various reports that need to be produced on a twice-weekly basis. To achieve this, you need to apply the Excel techniques that relate to the ETL procedure covered in class.
The end result should be that the data prepared conforms to the firm’s reporting template, which is intended for analysis purposes. The data sets you will draw from include:
- leads (potential customers)
- lead referral sources (where the leads come from)
- lead statuses (the status of each lead)
- customers (leads that were converted into customers)
- subscriptions (the service packages customers are subscribed to)
- countries (where customers are from).
It is important to note that the data is housed in two different systems. The lead data (including lead referral sources and lead statuses) is housed in the firm’s sales system. The customer data (including the subscriptions and countries) is housed in another system.
Unfortunately, the two systems aren’t interoperable, and the only link between the lead data and the customer data, is the ‘customer_id’, which is recorded in the ‘remarks’ column of the ‘leads’ table of the sales system for those customers with a ‘converted’ status.
You have been provided with Excel files embedded with connections to the firm’s various data sources, and each file simply needs to be refreshed to retrieve fresh data from its source (for this assignment, as these aren’t actually connected to real data sources, let’s just assume they contain fresh data). These files represent the data sets mentioned in the brief, and include:
In the Assignment I.xlsx file, by using Excel queries alone, you need to extract and combine relevant data from the files above.
For Part A, as a result of the above steps, the Assignment I.xlsx file should have:
- the pre-existing ‘appendix_definitions’ sheet (will cover in Part C)
- the pre-existing ‘definition_map’ sheet (will cover in Part C)
- the pre-existing ‘report_template’ sheet (will cover in Part E)
- 6 Excel queries, each responsible for extracting data from the before-mentioned files, and each to be named accordingly.
- 6 additional sheets resulting from executing the before-mentioned queries, and each named accordingly.
- 1 or more additional queries resulting from merging the data from the results of the before-mentioned 6 queries. The query should be name accordingly (e.g. ‘merged_data’)
- 1 or more sheets resulting for the above query (or queries). This sheet should contain all leads and related lead data, as well as the customer and customer related data for those leads that are conversions.
There are logical relations between each of the files, and it is possible to combine the data from the 6 different sources into one cohesive set of data. Note than in order to combine the lead and customer data, the customer_id contained within the ‘remarks’ column of the leads dataset needs to have some cleaning/transformation steps applied for the data to be merged. (Tip: With an Excel Query step you can add a custom column and use the Text.Remove() function to achieve this result.)
For this part, you are tasked with implementing steps for automatic data cleansing. It should be noted that the two systems where the data comes from are somewhat flawed when it comes to validating user input. Also, the sales system only provides limited purposed fields, and because of this, users have to resort to entering data in the ‘descriptor’ and ‘remarks’ fields. As a result of this, there may be data integrity issues relating to
- Duplicate records
- The year of birth of customers (some may be entered as Thai calendar years)
- The gender of leads
The steps for cleaning the data can be added as either:
- Extra steps in the queries implemented in Part A
- At least one added query drawing from the results of Part A
- Calculated fields added to the resulting sheet of Part A or the added queries in Part B (these calculated fields could for example employ text functions to correct data and should be named accordingly. For example, a field with original data named ‘year_of_birth’ could have a companion field called ‘year_of_birth_cleansed’, which is a calculated field based on the original, containing the cleansed data.
- A combination of the above
For Part B, as a result of the above steps, the Assignment I.xlsx file should have:
- the resulting queries and sheets from Part A.
- at least one more added queries and resulting sheet(s). Remember that often queries depend on the results of other queries, and therefore we should be careful about the order in which we execute queries. For that reason, it’s a good idea to include the order in the naming, e.g. “Step 7 – merged_data”. Make sure that the last query in the stack in clearly labelled as such.
- at least 5 calculated fields need to present, either in the sheet from the query above or in sheets from the results of Part A.
Importantly, since the data would be refreshed twice-weekly, instead of manually correcting the source data, you need to add implementations that would allow this to be done automatically each time.
For this part, you are tasked with translating data to meet the firm’s reporting requirements. Data needs to conform the definitions on the ‘appendix_definitions’ sheet.
Although this can be achieved through various means, for this part of the assignment you are required to map/translate equivalent values in the provided empty ‘definition_map’ sheet. This must be done by copying the existing values from the data sets and the definitions from ‘appendix_definitions’ to different columns on the ‘definition_map’ sheet, and by using the VLOOKUP function in added calculated fields in the data.
For Part C, as a result of the above steps, the Assignment I.xlsx file should have:
- the results of Part A and B
- a completed ‘definition_map’ sheet
- added calculated fields to one of the existing data sheets with the derived translations
Part D (Optional)
For this optional part, you are tasked with creating a pivot table on a new sheet and use it to test the validity/integrity of the data. You can rename the sheet to ‘integrity_checks’. You can use it to check for things such as misspelt values that may result in data being aggregated incorrectly, missing values, etc.
If any integrity issues are found, you can go back and address them in Part 1B by adding extra query steps or calculated field checks.
Implementing some form of integrity testing is important, as fresh data would be retrieved twice-weekly, meaning that new permutations of dirty data may exist.
For Part D, as a result of the above steps, the Assignment I.xlsx file should have:
- The results of Part A, B, and C
- An optionally added ‘integrity_checks’ sheet containing pivot tables or other means of checking for data integrity
For this part, you are tasked with creating a pivot table on a new sheet from the resulting data from Part A to C. The new sheet needs to be renamed to ‘report_data’, and the pivot table it contains needs to formatted in a way so that it’s values could be copied and pasted into the provided template on the ‘report_template’ sheet.
Only the blue-shaded columns of the template represent fields that need to be present in the pivot table. In order to derive the values required by the orange-shaded columns of the template, added calculated fields based on the pivot table columns are required.
For Part E, as a result of the above steps, the Assignment I.xlsx file should have:
- The results of Part A, B, and C
- An added sheet resulting from creating a pivot table, renamed to ‘report_data’. The columns of the pivot table and the added calculated columns should correspond directly with those of the table on the report_template sheet, i.e. they number of columns and the way they are ordered should match. Although the columns need to correspond to each other, the exact naming of the pivot table columns do not have to match exactly. Excess columns in the pivot table that don’t correspond to the template’s columns can be hidden.
- A completed ‘report_template’ sheet as result of copying and pasting the values from the ‘report_data’ sheet into the template’s table provided.