AIML500-Week 4 Quiz

1.Which of the following describes a common cause of data duplication?

A: Errors in data collection or integration from multiple sources 

2. When data from different sources have inconsistent formats, which data quality issue is being encountered?

A: Data integration 

3. Which of the following describes the challenge of data drift?

A: A shift in the statistical properties of the input data over time 

4. Why is it important to ensure data consistency in a dataset?

A: Inconsistent data can lead to inaccurate and unreliable model predictions 

5. How can data quality monitoring tools help in ensuring the reliability of machine learning models?

A: They detect and report anomalies, inconsistencies, and data drift 

6. What is the primary risk associated with data redundancy?

A: Increased training time with no added value 

7. What is the main issue caused by outliers in a dataset?

A: They can skew the results and mislead the model’s training process 

8. What is the most significant impact of poor data quality in machine learning?

A: Inaccurate predictions and unreliable models 

9. Data leakage occurs when:

A: Test data is used in training the model 

10. Which of the following best defines the term “data lineage” in the context of data quality?

A: The history and origin of data, including any transformations it has undergone 

11. Which of the following is NOT a method for handling noisy data?

A: Data augmentation

12. How does inconsistent data affect model performance?

A: It makes model results unpredictable and reduces reliability

13. Which of the following is NOT a common data quality issue in machine learning?

A: High-quality annotations 

14. Which of the following is a method to handle missing data?

A: Impute missing values using statistical methods 

15. Which of the following is an appropriate way to handle outliers in the data?

A: Use algorithms that are less sensitive to outliers, such as decision trees 

16. What type of data issue arises when some values in the dataset are incomplete or absent?

A: Missing data 

17. What is a challenge in using real-time data for training machine learning models?

A: The data may contain missing values or be noisy 

18. Which data preprocessing technique is commonly used to address data imbalance?

A: Synthetic data generation using techniques like SMOTE 

19. What is an effective strategy for detecting data duplicates?

A: Using algorithms that match records based on similarity metrics

20. What is data bias in machine learning?

A: When the data does not represent the full population and leads to skewed results

21. Data veracity refers to:

A: The reliability and accuracy of data

22. How does noisy data typically affect machine learning models?

A: It leads to overfitting or underfitting

23. How can imbalanced data affect a machine learning model?

A: It causes the model to perform poorly on underrepresented classes

24. What is a common consequence of using low-quality labels in supervised learning?

A: Misleading predictions and reduced model performance 

25. What is the role of feature scaling in addressing data quality issues?

A: It ensures all features contribute equally to the model’s learning process 

Leave a Reply

Your email address will not be published. Required fields are marked *