1.Which of the following describes a common cause of data duplication?
A: Errors in data collection or integration from multiple sources
2. When data from different sources have inconsistent formats, which data quality issue is being encountered?
A: Data integration
3. Which of the following describes the challenge of data drift?
A: A shift in the statistical properties of the input data over time
4. Why is it important to ensure data consistency in a dataset?
A: Inconsistent data can lead to inaccurate and unreliable model predictions
5. How can data quality monitoring tools help in ensuring the reliability of machine learning models?
A: They detect and report anomalies, inconsistencies, and data drift
6. What is the primary risk associated with data redundancy?
A: Increased training time with no added value
7. What is the main issue caused by outliers in a dataset?
A: They can skew the results and mislead the model’s training process
8. What is the most significant impact of poor data quality in machine learning?
A: Inaccurate predictions and unreliable models
9. Data leakage occurs when:
A: Test data is used in training the model
10. Which of the following best defines the term “data lineage” in the context of data quality?
A: The history and origin of data, including any transformations it has undergone
11. Which of the following is NOT a method for handling noisy data?
A: Data augmentation
12. How does inconsistent data affect model performance?
A: It makes model results unpredictable and reduces reliability
13. Which of the following is NOT a common data quality issue in machine learning?
A: High-quality annotations
14. Which of the following is a method to handle missing data?
A: Impute missing values using statistical methods
15. Which of the following is an appropriate way to handle outliers in the data?
A: Use algorithms that are less sensitive to outliers, such as decision trees
16. What type of data issue arises when some values in the dataset are incomplete or absent?
A: Missing data
17. What is a challenge in using real-time data for training machine learning models?
A: The data may contain missing values or be noisy
18. Which data preprocessing technique is commonly used to address data imbalance?
A: Synthetic data generation using techniques like SMOTE
19. What is an effective strategy for detecting data duplicates?
A: Using algorithms that match records based on similarity metrics
20. What is data bias in machine learning?
A: When the data does not represent the full population and leads to skewed results
21. Data veracity refers to:
A: The reliability and accuracy of data
22. How does noisy data typically affect machine learning models?
A: It leads to overfitting or underfitting
23. How can imbalanced data affect a machine learning model?
A: It causes the model to perform poorly on underrepresented classes
24. What is a common consequence of using low-quality labels in supervised learning?
A: Misleading predictions and reduced model performance
25. What is the role of feature scaling in addressing data quality issues?
A: It ensures all features contribute equally to the model’s learning process
