304 North Cardinal St.
Dorchester Center, MA 02124
Ready to move past Excel for complex business analysis? Then you’ll find this course very helpful.
This hands-on introductory Data Science course is aimed at professionals and students who don’t have any experience with programming. It will help you advance your career by preparing you to conduct meaningful data analysis in Python on any dataset — large or small.
You’ll begin with the fundamentals of Python, with focus on CSV files in Python, covering concepts like data preprocessing and Exploratory Data Analysis (EDA). In the second half, you’ll focus on predictive and inferential analysis using statistical and machine learning techniques, and learn how these techniques can help solve business problems.
def average(input_list): sum_list = 0 for i in input_list: sum_list = sum_list + i avg = sum_list/len(input_list) return avg
def factorial(n): if n==0 or n==1: return 1 if n < 1: return -1 product = 1 while(n > 1): product = product * n n = n-1 return product
Q1. A Dataframe is a 2-Dimensional object to store tabular data.
Q2. Suppose we have a
Gender column in our dataframe (
df) which has the values
Female. Which of these will give us a filtered dataframe of males. Select all answers you think are correct.
df = df['Male']
df = df[df['Male']]
condition = df['Gender'] == 'Male' df = df[condition]
df = df[ df['Gender'] == 'Male']
condition = df['Gender'] != 'Female' df = df[condition]
Q3. Which of these can be used to set the value of the first cell in the
Age column to 2323 if
Age is the first column in the dataset? Select all answers you think are correct.
df[0,'Age'] = 23
df.loc[0,'Age'] = 23
df.iloc[0,'Age'] = 23
df.iloc[0,0] = 23
Q4. Which of the following are aggregation functions, i.e., functions that take in a series and return a single value? Select all answers you think are correct.
apply function is used to apply custom functions to the data.
Q6. We can NOT group data for more than one variable.
pivot_table are used for summarizing data.
df.plot(kind = 'box',subplots = True, sharex=False, sharey = False)
In the above use of the
subplots=True tells the function to arrange all boxplots in rows and columns inside a group of plots.
def change_values(df): condition = df['BOROUGH'] == 1 df.loc[condition,'BOROUGH'] = 'Manhattan' condition = df['BOROUGH'] == 2 df.loc[condition,'BOROUGH'] = 'Bronx' condition = df['BOROUGH'] == 3 df.loc[condition,'BOROUGH'] = 'Brooklyn' condition = df['BOROUGH'] == 4 df.loc[condition,'BOROUGH'] = 'Queens' condition = df['BOROUGH'] == 5 df.loc[condition,'BOROUGH'] = 'Staten Island' return df
def remove_missing(df): present = df['SALE PRICE'].notnull() df = df[present] return df
def remove_duplicates(df): df = df.drop_duplicates(subset=df.columns) return df
def remove_outliers(df): # Retrieve only outlier columns new_df = df[['RESIDENTIAL UNITS', 'COMMERCIAL UNITS','TOTAL UNITS', 'LAND SQUARE FEET','GROSS SQUARE FEET','YEAR BUILT']] # find max and min using IQR Q1 = new_df.quantile(0.10) Q3 = new_df.quantile(0.90) IQR = Q3-Q1 minimum = Q1 - 1.5*IQR maximum = Q3 + 1.5*IQR # condition on which to filter condition = (new_df <= maximum) & (new_df >= minimum) condition = condition.all(axis=1) # Filter rows that have outliers df = df[condition] return df
Q1. What is the mean of
Q2. How many times do
LIMIT_BAL values appear in the interval (100000.0, 200000.0] ?
Q3. What is the 75% percentile of
Q4. What is the skew value of
Q1. How many married persons have defaulted in our dataset?
Q2. How many single persons have NOT defualted in our dataset?
Q3. What is the probability of a married person defaulting next month?
Q4. A single person is more likely to default the next month than a married person in our dataset.
Q1. How many people lie in the interval (0, 100000] of
LIMIT_BAL who have defaulted?
Q2. What is the probability of people defaulting who get
LIMIT_BAL in the interval (100000, 200000] ?
Q3. As the
LIMIT_BAL given to a person increases, the probability of the person defaulting decreases.
def exercise_1(df): temp = df.groupby('CustomerID').size() temp = temp.sort_values(ascending=False) temp = temp.iloc[:5] return temp def exercise_2(df): temp = df.groupby('CustomerID').sum() temp = temp['AmountSpent'] temp = temp.sort_values(ascending=False) temp = temp.iloc[:5] return temp def exercise_3(df): temp = df.groupby('Country').size() temp = temp.sort_values(ascending=False) temp = temp.iloc[:5] return temp def exercise_4(df): condition = df['PurchaseYear'] == 2011 temp = df[condition] temp = temp.groupby('PurchaseMonth').size() return temp def exercise_5(df): temp = df.groupby('Description').sum() temp = temp['Quantity'] temp = temp.sort_values(ascending=False) temp = temp.iloc[:10] return temp
def churn_predict_acc(X,Y,test_inputs,test_outputs): # Write code here lr = LogisticRegression() lr.fit(X,Y) preds = lr.predict(test_inputs) acc = accuracy_score(y_true = test_outputs,y_pred = preds) return acc
Q1. Artificial Intelligence is a sub domain of Machine Learning.
Q2. Decision Trees capture non linear relationships between variables.
Q3. Linear Regression models can NOT capture non linear relationships.
Q4. Out of the following algorithms:
Which performs better?
Q5. Random Forest is a boosting algorithm.
Q6. In bagging, individual models train on data that is sampled _____.
Q7. Which of the following algorithms can be used for unsupervised learning? Check all answer that you think are correct.
Q8. PCA is used for
km = KMeans(n_clusters = 2)
result = km.predict(data)
In the above code, what is being stored in
Q10. Clustering can NOT be used to segment customer groups.
I hope this Data Science for Non-Programmers Educative Quiz Answers would be useful for you to learn something new from this problem. If it helped you then don’t forget to bookmark our site for more Coding Solutions.
This Problem is intended for audiences of all experiences who are interested in learning about Data Science in a business context; there are no prerequisites.
More Coding Solutions >>