1) A data scientist has a large dataset with some missing values in a numerical column. Deleting every row with a missing value would cause significant data loss. What is a common and effective alternative strategy for handling these missing values?
A) Impute the missing values with a measure of central tendency, like the mean or median of the column.
2) Which type of business problem is often addressed using unsupervised learning methods?
A) Identifying natural groupings or segments within customer data.
3) What is “data imputation” a technique for?
A) Filling in missing values in a dataset.
4) The data science principle of “Garbage In, Garbage Out” (GIGO) implies which of the following?
A) The performance and reliability of an AI model are fundamentally limited by the quality of its training data.
5) What is the main objective of developing a data strategy for a business?
A) To outline how data will be acquired, managed, and utilized to achieve business goals.
6) What is the primary purpose of Exploratory Data Analysis (EDA)?
A) To understand the main characteristics of data, often with visual methods.
7) What is the primary purpose of using a technique like One-Hot Encoding during data preprocessing?
A) To convert categorical text data into a binary numerical format that a model can understand.
8) Which of the following is NOT a common goal of data cleaning?
A) Introducing bias into the dataset.
9) What is the primary goal of clustering in unsupervised learning?
A) To group similar data points together without prior labels.
10) Which visualization tool is commonly used during EDA to show the distribution of a single numerical variable?
A) Histogram.
11) What does a “robust” machine learning algorithm imply in the context of data preparation?
A) It performs well even with noisy, incomplete, or inconsistent data.
12) A dataset contains features for ‘Age’ (ranging from 18-65) and ‘Annual Income’ (ranging from $30,000-$150,000). Why is it crucial to apply a technique like Normalization before training many types of machine learning models?
A) To prevent the ‘Annual Income’ feature from disproportionately influencing the model due to its larger scale.
13) What is a common challenge when performing Exploratory Data Analysis (EDA)?
A) It can be time-consuming and requires domain knowledge.
14) What is a “centroid” in the context of K-means clustering?
A) The central point of a cluster, representing its average position.
15) Why is data preparation essential before building machine learning algorithms?
A) It ensures the data is in a suitable format and quality for the algorithm to learn effectively.
16) In the K-means algorithm, what does ‘K’ represent?
A) The number of clusters to form.
17) K-means is a popular algorithm for which type of unsupervised learning?
A) Clustering.
18) What role does “data strategy” play in a business’s overall AI adoption?
A) It ensures data is collected, managed, and used effectively to support AI initiatives and business objectives.
19) Which of the following is a key characteristic of unsupervised learning?
A) It aims to discover hidden patterns or structures in data.
20) When building clusters using K-means, what is a common initial step?
A) Determining the optimal number of clusters (K).
