WEEK- 2
code
install.packages(“dplyr”)
library(dplyr)
Rajeshdf = read.csv(‘c:\\Insurance.csv’)
str(Rajeshdf)
str(Rajeshdf)
summary(Rajeshdf)
agg_tbl <- Rajeshdf %>% group_by(Rajeshdf$JOB) %>%
summarise(total_count=n(),
.groups = ‘drop’)
agg_tbl
a = aggregate( x=Rajeshdf$HOME_VAL, by=list( Rajeshdf$CAR_TYPE), FUN=median, na.rm=TRUE )
QUIZ
1.What famous literary detective solved a crime because a dog did not bark at the criminal?
Sherlock Holmes
2.In the Insurance data set, how many Lawyers are there?
1031
3. What two prefixes does the instructor use for variables when fixing the missing values? Select all that apply.
IMP_
M_
4. What is the median Home Value of a person who drives a Van?
204139
5. In the insurance data set, how many missing (NA) values does the variable AGE have?
7
6. What is the process called where missing data is fixed?
Imputing
7.According to the instructor, approximately what percentage of the analytic time is spent on data preparation?
90%
8.In the Insurance data set, how many Blue Collar workers are there?
2288
9.What is the median Home Value of a person who drives a Panel Truck?\
220541
10.In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?
0
11.In the Insurance data set, how many Doctors are there?
321
12. What is the median Home Value of a person who drives a Pickup?
151061
13.In the insurance data set, how many missing (NA) values does the variable AGE have?
7
14. What is the process called that converts categorical variables into flag variables?
One Hot Encoding
15.In the insurance data set, how many missing (NA) values does the variable KIDSDRIV have?
0
16.In the R programming language, what is one method for converting a TRUE/FALSE variable into a 1/0 variable?
Add the number zero (0) to the TRUE/FALSE variable.
17.What is the median Home Value of a person who drives an SUV?
140927
18. According to the instructor, after a variable with missing values is “fixed”, it is a good idea to remove the variable from the data set.
True
19. What is the median Home Value of a person who drives a Minivan?
172269
20.In the insurance data set, how many missing (NA) values does the variable YOJ have?
548
21.In the Insurance data set, how many Home Makers are there?
843
22.In the Insurance data set, how many Clerical workers are there?
1590
23.In the insurance data set, how many missing (NA) values does the variable CAR_AGE have?
639
WEEK 5 QUIZ
1.Random Forests and the Gradient Boosting models will usually be more accurate than Decision Tree models.
True
2.Which of these modelling techniques is not adversely affected by outliers?
All of these
3.Gradient Boosting models are easy to interpret.
False
4.Which of these modelling techniques trains many trees with each tree is built on a random subset of variables?
Random Forests
5.Which of these modelling techniques tends to use many small trees?
Gradient Boosting
6.Which of these modelling techniques is usually the easiest to interpret?
Decision Trees
7.Random Forests are easy to interpret.
False
8.In the United States, it is probably against the law to use a Gradient Boosting model for Marketing models.
FALS
9.Gradient Boosting models are based on Decision Trees.
True
10.Which of these modelling techniques is usually the fastest to train?
Decision Trees
11. Random Forests and the Gradient Boosting models will always be more accurate than Decision Tree models.
False
12. A Random Forest is more sensitive to a small input change than a Decision Tree
False
13. Which of these modelling techniques trains many trees with each tree is built on a random subset of records?
Random Forests
14.Random Forests are based on Decision Trees.
True
15.A Gradient Boosting model is less sensitive to a small input change than a Decision Tree
True
16. In the United States, it may be against the law to use a Gradient Boosting models for Credit or Auto Insurance models.
True
17.Which of these modelling techniques alters the data in order to over sample records that it incorrectly classified?
Gradient Boosting
18. Which of these modelling techniques is usually the easiest to convert into IF-THEN-ELSE rules?
Decision Trees
19. In the United States, it may be against the law to use a Random Forest for Credit or Auto Insurance models.
True
20. In the United States, it is probably against the law to use a Random Forest for Marketing models.
False
21.1. WhendoingtSNE analysis,settingthePerplexitytoalownumberwilltendto favor local aspects of the data. High numbers will tend to favor global data.
True
2. Principal Components are always Orthogonal to one another.
True
3. When doing tone analysis, setting the Perple xity to a high number will tend to have less well-defined groupings.
False
4. When doing tSNE analysis, setting the Perplexity to a high number will tend to have more well-defined groupings.
False
5. In analysis, the vectors represent a LINEA relationship in the data.
True
6. Assume that you have 3 continuous variables in your dataset, how many Principal Components will be created if you do a PCA Analysis?
3
7. Principal Components are always Independent to one another.
True
8. In the programming language, the”prompt” function allows for scoring data using the “predict” command.
8
9. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?
8
10. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.
False
11. Assume that you have 3 continuous variables in your dataset, how many Principal Components will be created if you do a PCA Analysis?
3
12) When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.
False
13) When doing tSNE analysis, setting the Perplexity to a high number will not have more well-defined groupings.
False
14) Assume that you have 2 continuous variables in your dataset, how many Principal Components will be created if you do a PCA Analysis?
2
15) tSNE vectors are always Or orthogonal to one another.
False
16) tSNE vectors are always Orthogonal to one another.
False
17) Assume that have 8 continues variables in your dataset how many principal components will be created if you do a PCA analysis?
8
18) Assume that you have 8 continuous variables in yourdataset, how many Principal Components will be created if you do a tSNE Analysis using Rtsne?
2or3
19) In PC Aanalysis, the vectors represent a NON LINEAR relationship in the data.
False
20) Assume that an input data set has four variables : A,B,C,D and they are used to create four Principal Components: PC1, PC2, PC3, and PC4.If A,B,C,D are allhighly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?
PC1,PC2,PC3,andPC4 are completely uncorrelated from one another
21) In tSNE analysis, the vectors represent a LINEAR relationship in the data
False
22) Given the following Scree Plot,how many Principal Components should be used?
2 or possibly 3 Principal Components
23) In the R programming language, the “Rtsne” functional lows for scoring data using the “predict” command.
False
24. To answer this question, please refer to the CRAN Packages web page referred to in the course material.
Which of these packages are used for Optical Character Recognition?
abbyyR
25. Using the iris data set in R, generate a box plot by Species of the variable Petal Length.

26.Using the iris data set in R, generate a box plot by Species of the variable Petal Width.

27. Using the iris data set in R, generate a box plot by Species of the variable Sepal Length.

1. What are the two commands that will return the first and last six rows of a Data Frame?
head, tail
2. The R programming language has data sets that are pre-loaded. One of these data sets is the “iris” data set. What command will give you information about this data set?
iris
3. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=1 ?
88.0
4. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=2 ?
104.5
5. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=3 ?
125.5
6. In the R data set, ChickWeight, calculate the median weight value by Diet. What is the median weight of a chick that has Diet=4 ?
129.5
7. To answer this question, please refer to the CRAN Packages web page referred to in the course material.
Which of these packages are used for Reliability and Scoring Routines?
ATtools
8. How many records are in the predefined data set named “trees”
31
9. There is no guarantee that an R Package included in CRAN will be maintained and “up to date”.
False
10. Which of these packages are used for Combining Multidimensional Arrays?
abind
11. How many records are in the predefined data set named “cars”
50
12. Which of these packages are used for Baysian approximation?
abc
13. If an R Package is included in CRAN it is guaranteed to be regularly updated, and will always be “up to date”.
False
R
WEEK-7 QUIZ
1. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor local aspects of the data. High numbers will tend to favor global data.
True
2. In the R programming language, the “Rtsne” function allows for scoring data using the “predict” command.
False
3. Assume that an input data set has four variables: A,B,C,D and they are used to create four Principal Components: PC1, PC2, PC3, and PC4. If A,B,C,D are all highly correlated, then what do you know about the correlation of PC1, PC2, PC3, and PC4?
PC1, PC2, PC3, and PC4 are completely uncorrelated from one another.
4. tSNE vectors are always Independent to one another.
False
5. Assume that you have 3 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?
3
6. Principal Components are always Orthogonal to one another
True
7. tSNE vectors are always Orthogonal to one another.
False
8. When doing tSNE analysis, setting the Perplexity to a low number will tend to favor global aspects of the data. High numbers will tend to favor local data.
False
9. Assume that you have 2 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?
2
10. Principal Components are always Orthogonal to one another.
True
11. In the R programming language, the “Rtsne” function allows for scoring data using the “predict” command.
False
12. In tSNE analysis, the vectors represent a LINEAR relationship in the data.
False
13. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a tSNE Analysis using Rtsne?
2 or 3
14. In tSNE analysis, the vectors represent a NON LINEAR relationship in the data.
Ture
15. When doing tSNE analysis, setting the Perplexity to a high number will tend to have less well defined groupings.
False
16. In the R programming language, the “prcomp” function allows for scoring data using the “predict” command.
True
17. In PCA analysis, the vectors represent a NON LINEAR relationship in the data.
False
18. In PCA analysis, the vectors represent a LINEAR relationship in the data.
True
19. When doing tSNE analysis, setting the Perplexity to a high number will tend to have more well defined groupings.
True
20. Assume that you have 8 continuous variables in your data set, how many Principal Components will be created if you do a PCA Analysis?
8
21. Given the following Scree Plot, how many Principal Components should be used?
1 or possibly 2 Principal Components
22. Principal Components are always Independent to one another.
True
Other Links:
See other websites for quiz:
Check on QUIZLET
