1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
| import numpy as np import pandas as pd
""" load_data_set(): 功能:载入数据集 输入:无 返回: posting_list:数据集; classes_list:各属性的类别 property_list:各个属性的属性值集合列表 """
def csv_input(CSV_FILE_PATH): df = pd.read_csv(CSV_FILE_PATH) return df.values.tolist()
def load_data_set(): posting_list=csv_input('./西瓜的数据集.csv')
property_list = [ '青绿', '乌黑', '浅白', '蜷缩', '稍缩', '硬挺', '浊响', '沉闷', '清脆', '清晰', '稍糊', '模糊', '凹陷', '稍陷', '稍凹', '硬滑', '软粘'] return posting_list, property_list
""" getNums(posting_list,col,rows,nums): 功能:取值函数,从数据集中取出某一列(密度或者含糖率样本)多行的值,返回一维数组,及其长度 输入:posting_list:数据集;col:所取数据集的列号;rows:所取数据集的开始行号;nums:取的数据行数; 输出:Nums:浮点型数据列表 """
def getNums(posting_list, col, rows, nums): Nums = [0] * nums for n in range(0, nums): Nums[n] = float(posting_list[rows + n][col]) return Nums
""" train_naive_bayes(posting_list,property_list): 功能:训练数据,即计算数据集的类条件概率,类先验概率 输入:posting_list:数据集;property_list:各个属性的属性值集合构成的列表 输出:propertyConditionalProbabilityPositive:正样本(好瓜)类条件概率 propertyConditionalProbabilityNegative:负样本类条件概率 """
def train_naive_bayes(posting_list, property_list): trainNum = len(posting_list) pSampleNum = 0 for sample in posting_list: if sample[-1] == '是': pSampleNum += 1 prioClass = pSampleNum / trainNum propertyConditionalProbabilityPositive = [] propertyConditionalProbabilityNegative = []
for propertyy in property_list: if (propertyy == "硬滑") or (propertyy == "软粘"): pSampleNumLap = pSampleNum + 2 nSampleNumLap = trainNum - pSampleNum + 2 else: pSampleNumLap = pSampleNum + 3 nSampleNumLap = trainNum - pSampleNum + 3
posNumPropertyPositive = 1 negNumPropertyPositive = 1
for rows in range(0, len(posting_list)): if propertyy in posting_list[rows]: if posting_list[rows][-1] == '是': posNumPropertyPositive += 1
else: negNumPropertyPositive += 1 propertyConditionalProbabilityPositive.append(posNumPropertyPositive / pSampleNumLap) propertyConditionalProbabilityNegative.append(negNumPropertyPositive / nSampleNumLap)
return propertyConditionalProbabilityPositive, propertyConditionalProbabilityNegative
""" classify_naive_bayes(data,propertyConditionalProbabilityPositive,property_list,propertyConditionalProbabilityNegative): 功能:求正负类别的概率,返回1或者0, 1表示为正样本 输入:data:想要测试的数据,格式见底部说明。propertyConditionalProbabilityPositive正类条件概率;propertyConditionalProbabilityNegative负类条件概率 输出:返回1或者0;其中1代表正样本(好瓜),0代表负样本 """
def classify_naive_bayes(data, propertyConditionalProbabilityPositive, property_list, propertyConditionalProbabilityNegative): probabilityPos = 0 probabilityNeg = 0 for propertyData in data[:-1]: if propertyData in property_list: index = property_list.index(propertyData) probabilityPos += np.log(propertyConditionalProbabilityPositive[index]) probabilityNeg += np.log(propertyConditionalProbabilityNegative[index])
if probabilityPos > probabilityNeg: return 1 else: return 0
if __name__ == "__main__": posting_list, property_list = load_data_set() propertyConditionalProbabilityPositive, propertyConditionalProbabilityNegative = train_naive_bayes(posting_list, property_list)
data=[] data=csv_input('./西瓜的测试集.csv') accuracy=0 for n in range(0,len(data)): result = classify_naive_bayes(data[n], propertyConditionalProbabilityPositive, property_list, propertyConditionalProbabilityNegative) if result==1 and data[n][-1]=='是' or result==0 and data[n][-1]=='否': accuracy+=1 print('is it the good melon : {} {}'.format(result,data[n][-1])) print('准确率为:{}'.format(accuracy/len(data)) )
|