IS5213 Data Science and Big Data Solutions

WEEK- 2

code

install.packages(“dplyr”)

library(dplyr)

Rajeshdf = read.csv(‘c:\\Insurance.csv’)

str(Rajeshdf)                       

str(Rajeshdf)

summary(Rajeshdf)

agg_tbl <- Rajeshdf %>% group_by(Rajeshdf$JOB) %>% 

  summarise(total_count=n(),

            .groups = ‘drop’)

agg_tbl

a = aggregate( x=Rajeshdf$HOME_VAL, by=list( Rajeshdf$CAR_TYPE), FUN=median, na.rm=TRUE )

QUIZ

1.What famous literary detective solved a crime because a dog did not bark at the criminal?

Sherlock Holmes

2.In the Insurance data set, how many Lawyers are there?

1031

3. What two prefixes does the instructor use for variables when fixing the missing values? Select all that apply.

IMP_

M_

4. What is the median Home Value of a person who drives a Van?

204139

5. In the insurance data set, how many missing (NA) values does the variable AGE have?

 7

6. What is the process called where missing data is fixed?

 Imputing

7.According to the instructor, approximately what percentage of the analytic time is spent on data preparation?

90%

8.In the Insurance data set, how many Blue Collar workers are there?

 2288

9.What is the median Home Value of a person who drives a Panel Truck?\

 220541

10.In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?

0

11.In the Insurance data set, how many Doctors are there?

321

12. What is the median Home Value of a person who drives a Pickup?

151061

13.In the insurance data set, how many missing (NA) values does the variable AGE have?

7

14. What is the process called that converts categorical variables into flag variables?

One Hot Encoding

15.In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?

0

16.In the R programming language, what is one method for converting a TRUE/FALSE variable into a 1/0 variable?

Add the number zero (0) to the TRUE/FALSE variable.

17.What is the median Home Value of a person who drives an SUV?

 140927

18. According to the instructor, after a variable with missing values is “fixed”, it is a good idea to remove the variable from the data set.

True

19. What is the median Home Value of a person who drives a Minivan?

172269

20.In the insurance data set, how many missing (NA) values does the variable YOJ have?

548

21.In the Insurance data set, how many Home Makers are there?

843

22.In the Insurance data set, how many Clerical workers are there?

1590

23.In the insurance data set, how many missing (NA) values does the variable CAR_AGE have?

639

WEEK 5 QUIZ

1.Random Forests and the Gradient Boosting models will usually be more accurate than Decision Tree models.

True

2.Which of these modelling techniques is not adversely affected by outliers?

All of these

3.Gradient Boosting models are easy to interpret.

False

4.Which of these modelling techniques trains many trees with each tree is built on a random subset of variables?

Random Forests

5.Which of these modelling techniques tends to use many small trees?

Gradient Boosting

6.Which of these modelling techniques is usually the easiest to interpret?

Decision Trees

7.Random Forests are easy to interpret.

 False

8.In the United States, it is probably against the law to use a Gradient Boosting model for Marketing models.

FALS

9.Gradient Boosting models are based on Decision Trees.

True

10.Which of these modelling techniques is usually the fastest to train?

Decision Trees

11. Random Forests and the Gradient Boosting models will always be more accurate than Decision Tree models.

False

12. A Random Forest is more sensitive to a small input change than a Decision Tree

False

13. Which of these modelling techniques trains many trees with each tree is built on a random subset of records?

 Random Forests

14.Random Forests are based on Decision Trees.

True

15.A Gradient Boosting model is less sensitive to a small input change than a Decision Tree

True

16. In the United States, it may be against the law to use a Gradient Boosting models for Credit or Auto Insurance models.

True

 17.Which of these modelling techniques alters the data in order to over sample records that it incorrectly classified?

Gradient Boosting

18. Which of these modelling techniques is usually the easiest to convert into IF-THEN-ELSE rules?

Decision Trees

19. In the United States, it may be against the law to use a Random Forest for Credit or Auto Insurance models.

True

20. In the United States, it is probably against the law to use a Random Forest for Marketing models.

False

21.1. WhendoingtSNE analysis,settingthePerplexitytoalownumberwilltendto favor local aspects of the data. High numbers will tend to favor global data.

True

2.   Principal Components are always Orthogonal to one another.

True

3.   When doing tone analysis, setting the Perple xity to a high number will tend to have less well-defined groupings.

False

4.   When doing tSNE analysis, setting the Perplexity to a high number will tend to have more well-defined groupings.

False

5. In analysis, the vectors represent a LINEA relationship in the data.

True

6.   Assume that you have 3 continuous variables in your dataset, how many Principal Components will be created if you do a PCA Analysis?

3

7.   Principal Components are always Independent to one another.

True

8. In the programming language, the”prompt” function allows for scoring data using the “predict” command.

8

9.   Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

8

10.    When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.

False

11. Assume that you have 3 continuous variables in your dataset, how many Principal Components will be created if you do a PCA Analysis?

3

12)  When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.

False

13)  When doing tSNE analysis, setting the Perplexity to a high number will not have more well-defined groupings.

False

14)  Assume that you have 2 continuous variables in your dataset, how many Principal Components will be created if you do a PCA Analysis?

2

15)  tSNE vectors are always Or orthogonal to one another. 

False

16)  tSNE vectors are always Orthogonal to one another.

False

17)  Assume that have 8 continues variables in your dataset how many principal components will be created if you do a PCA analysis?

8

18)  Assume that you have 8 continuous variables in yourdataset, how many Principal Components will be created if you do a tSNE Analysis using Rtsne?

2or3

19)  In PC Aanalysis, the vectors represent a NON LINEAR relationship in the data. 

False

20)  Assume that an input data set has four variables : A,B,C,D and they are used to create four Principal Components: PC1, PC2, PC3, and PC4.If A,B,C,D are allhighly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?

PC1,PC2,PC3,andPC4 are completely uncorrelated from one another

21)  In tSNE analysis, the vectors represent a LINEAR relationship in the data 

False

22)  Given the following Scree Plot,how many Principal Components should be used?

2 or possibly 3 Principal Components

23)  In the R programming language, the “Rtsne” functional lows for scoring data using the “predict” command.

False

24. To answer this question, please refer to the CRAN Packages web page referred to in the course material.
Which of these packages are used for Optical Character Recognition?

abbyyR

25. Using the iris data set in R, generate a box plot by Species of the variable Petal Length.

26.Using the iris data set in R, generate a box plot by Species of the variable Petal Width.

27. Using the iris data set in R, generate a box plot by Species of the variable Sepal Length.

1. What are the two commands that will return the first and last six rows of a Data Frame?

 head, tail

2. The R programming language has data sets that are pre-loaded. One of these data sets is the “iris” data set. What command will give you information about this data set?

iris

3.  In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=1 ?

88.0

4. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=2 ?

104.5

5. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=3 ?

125.5

6. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=4 ?

129.5

7. To answer this question, please refer to the CRAN Packages web page referred to in the course material.
Which of these packages are used for Reliability and Scoring Routines?

ATtools

8. How many records are in the predefined data set named “trees”

31

9. There is no guarantee that an R Package included in CRAN will be maintained and “up to date”. 

False

10. Which of these packages are used for Combining Multidimensional Arrays?

abind

11. How many records are in the predefined data set named “cars”

50 

 12. Which of these packages are used for Baysian approximation?

abc

13. If an R Package is included in CRAN it is guaranteed to be regularly updated, and will always be “up to date”. 

False

R

WEEK-7 QUIZ

1. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor local aspects of the data. High numbers will tend to favor global data.

True

2. In the R programming language, the “Rtsne” function allows for scoring data using the “predict” command.

False

3. Assume that an input data set has four variables: A,B,C,D and they are used to create four Principal Components: PC1, PC2, PC3, and PC4. If A,B,C,D are all highly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?

PC1, PC2, PC3, and PC4 are completely uncorrelated from one another.

4. tSNE vectors are always Independent to one another.

False

5. Assume that you have 3 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

3

6. Principal Components are always Orthogonal to one another

True

7. tSNE vectors are always Orthogonal to one another.

False

8. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.

False

9. Assume that you have 2 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

2

10. Principal Components are always Orthogonal to one another.

True

11. In the R programming language, the “Rtsne” function allows for scoring data using the “predict” command. 

False

12. In tSNE analysis, the vectors represent a LINEAR relationship in the data.

False

13. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a tSNE Analysis using Rtsne?

2 or 3

14. In tSNE analysis, the vectors represent a NON LINEAR relationship in the data.

Ture

15. When doing tSNE analysis, setting the Perplexity to a high number will tend to have less well defined groupings.

False

16. In the R programming language, the “prcomp” function allows for scoring data using the “predict” command. 

True

17. In PCA analysis, the vectors represent a NON LINEAR relationship in the data.

False

18. In PCA analysis, the vectors represent a LINEAR relationship in the data.

True

19. When doing tSNE analysis, setting the Perplexity to a high number will tend to have more well defined groupings.

True

20. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?

8

21. Given the following Scree Plot, how many Principal Components should be used?

1 or possibly 2 Principal Components

22. Principal Components are always Independent to one another.

True

Other Links:

Statistics Quiz

Networking Quiz

See other websites for quiz:

Check on QUIZLET

Check on CHEGG

Leave a Reply

Your email address will not be published. Required fields are marked *