Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Given a group of values, the entropy of the group is defined as the formula as following:
where P(x)
is the probability of appearance for the value x
.
e.g.
the input group: [1, 1, 2, 2]
the probability of value 1 is: 2/4 = 1/2
the probability of value 2 is: 2/4 = 1/2
Therefore, its entropy can be obtained by: - (1/2) * log2(1/2) - (1/2) * log2(1/2) = 1/2 + 1/2 = 1
This exercise, however, is aimed to calculate the maximum information gain
that one can obtain by splitting a group into two subgroups. The information gain is the difference of entropy before and after the splitting.
For a group of L, we divide it into subgroups of {L1, L2}, then the information gain is calculated as following:
The overall entropy of the splitted subgroups {L1, L2} is the sum of entropy for each subgroup weighted by its proportion with regards to the original group.
In this exercise, one can expect a list of samples on Iris flowers. Each sample is represented with a tuple of two values: <petal_length, species>
, where the first attribute is the measurement on the length of the petal for the sample, and the second attribute indicates the species of sample. Here is an example.
The task is to split the sample list into two sublists, while complying with the following two conditions:
petal_length
of sample in one sublist is less or equal than that of any sample in the other sublist.species
attribute of the sublists is maximal among all possible splits.As output, one should just return the information gain.
In addition, one can expect that each value of petal_length
is unique.
In the above example, the optimal split would be L1 = [(0.5, ‘setosa’), (1.0, ‘setosa’)] and L2 = [(1.5, ‘versicolor’), (2.3, ‘versicolor’)]. According the above formulas, the maximum information gain for this example would be 1.0.
Note: For certain languages (e.g. Java), there is no build-in type of tuple. As a reuslt, to make the input more general, we decompose the input into two lists: [petal_length]
[species]
respectively, where the petal_length would be of float value and the species is of string. The elements in the petal_length list and species list are associated to each other one by one by order.
class Solution:
def calculateMaxInfoGain(self, petal_length: List[float], species: List[str]) -> float:
if not petal_length or not species:
return 0
n = len(petal_length)
Hs = math.inf
mp = sorted(zip(petal_length, species))
petal_length, species = zip(*mp)
for i in range(1, n):
H1 = self.calculateEntropy(species[:i])
H2 = self.calculateEntropy(species[i:])
Hs = min(Hs, H1*i/n + H2*(n-i)/n)
Hl = self.calculateEntropy(species)
return Hl-Hs
def calculateEntropy(self, species: List[str]) -> float:
size = len(species)
probs = collections.defaultdict(int)
for sp in species:
probs[sp] += 1/size
if len(probs.values())<=1:
return 0
return -sum([p*log2(p) for p in probs.values()])
class Solution {
public:
double calEntropy(vector<string>& species)
{
unordered_map<string, int> shown;
double total = species.size();
for(const string& specie:species)
{
++shown[specie];
}
double entropy = 0.0;
for(auto& n:shown)
{
double p = n.second / total;
entropy -= p*log(p)/log(2.0);
}
return entropy;
}
double calculateMaxInfoGain(vector<double>& petal_length, vector<string>& species) {
// Avoid edge case
if(petal_length.empty() || species.empty()) return 0.0;
int s = petal_length.size();
double all_entropy = calEntropy(species);
double information_gain = 0.0;
// Iterate all possible splits
for(int i = 0; i < s; ++i)
{
// Split list
double split = petal_length[i];
vector<string> l1;
vector<string> l2;
for(int i = 0; i < s; ++i)
{
if(petal_length[i] < split)
{
l1.push_back(species[i]);
}
else
{
l2.push_back(species[i]);
}
}
// Calculate information gain
double l1_entropy = calEntropy(l1);
double l2_entropy = calEntropy(l2);
double p1 = (double)l1.size()/s;
double p2 = (double)l2.size()/s;
information_gain = max(information_gain, all_entropy - l1_entropy*p1 - l2_entropy*p2);
}
return information_gain;
}
};
In our experience, we suggest you solve this Calculate Maximum Information Gain LeetCode Solution and gain some new skills from Professionals completely free and we assure you will be worth it.
If you are stuck anywhere between any coding problem, just visit Queslers to get the Calculate Maximum Information Gain LeetCode Solution
I hope this Calculate Maximum Information Gain LeetCode Solution would be useful for you to learn something new from this problem. If it helped you then don’t forget to bookmark our site for more Coding Solutions.
This Problem is intended for audiences of all experiences who are interested in learning about Data Science in a business context; there are no prerequisites.
Keep Learning!
More Coding Solutions >>