ããK-Meansæ¯å¸¸ç¨çèç±»ç®æ³ï¼ä¸å
¶ä»èç±»ç®æ³ç¸æ¯ï¼å
¶æ¶é´å¤æ度ä½ï¼èç±»çææä¹è¿ä¸éï¼è¿éç®åä»ç»ä¸ä¸k-meansç®æ³ï¼ä¸å¾æ¯ä¸ä¸ªæåä½æ°æ®éèç±»çç»æã
ããåºæ¬ææ³
ããk-meansç®æ³éè¦äºå
æå®ç°ç个æ°kï¼ç®æ³å¼å§éæºéæ©k个记å½ç¹ä½ä¸ºä¸å¿ç¹ï¼ç¶åéåæ´ä¸ªæ°æ®éçåæ¡è®°å½ï¼å°æ¯æ¡è®°å½å½å°ç¦»å®æè¿çä¸å¿ç¹æå¨çç°ä¸ï¼ä¹å以å个ç°çè®°å½çåå¼ä¸å¿ç¹å代ä¹åçä¸å¿ç¹ï¼ç¶åä¸æè¿ä»£ï¼ç´å°æ¶æï¼ç®æ³æè¿°å¦ä¸ï¼
ããä¸é¢è¯´çæ¶æï¼å¯ä»¥çåºä¸¤æ¹é¢ï¼ä¸æ¯æ¯æ¡è®°å½æå½å±çç°ä¸åååï¼äºæ¯ä¼åç®æ ååä¸å¤§ãç®æ³çæ¶é´å¤æ度æ¯O(K*N*T)ï¼kæ¯ä¸å¿ç¹ä¸ªæ°ï¼Næ°æ®éç大å°ï¼Tæ¯è¿ä»£æ¬¡æ°ã
ããä¼åç®æ
ããk-meansçæ失å½æ°æ¯å¹³æ¹è¯¯å·®ï¼
ããRSSk=âxâÏk|x?u(Ïk)|2
ããRSS=âk=1KRSSk
ããå
¶ä¸$\omega _k$表示第k个ç°ï¼$u(\omega _k)$表示第k个ç°çä¸å¿ç¹ï¼$RSS_k$æ¯ç¬¬k个ç°çæ失å½æ°ï¼$RSS$表示æ´ä½çæ失å½æ°ãä¼åç®æ å°±æ¯éæ©æ°å½çè®°å½å½å±æ¹æ¡ï¼ä½¿å¾æ´ä½çæ失å½æ°æå°ã
ããä¸å¿ç¹çéæ©
ããk-meamsç®æ³çè½å¤ä¿è¯æ¶æï¼ä½ä¸è½ä¿è¯æ¶æäºå
¨å±æä¼ç¹ï¼å½åå§ä¸å¿ç¹éåä¸å¥½æ¶ï¼åªè½è¾¾å°å±é¨æä¼ç¹ï¼æ´ä¸ªèç±»çææä¹ä¼æ¯è¾å·®ãå¯ä»¥éç¨ä»¥ä¸æ¹æ³ï¼k-meansä¸å¿ç¹
ãã1ãéæ©å½¼æ¤è·ç¦»å°½å¯è½è¿çé£äºç¹ä½ä¸ºä¸å¿ç¹ï¼
ãã2ãå
éç¨å±æ¬¡è¿è¡åæ¥èç±»è¾åºk个ç°ï¼ä»¥ç°çä¸å¿ç¹çä½ä¸ºk-meansçä¸å¿ç¹çè¾å
¥ã
ãã3ãå¤æ¬¡éæºéæ©ä¸å¿ç¹è®ç»k-meansï¼éæ©æææ好çèç±»ç»æ
ããkå¼çéå
ããk-meansç误差å½æ°æä¸ä¸ªå¾å¤§ç¼ºé·ï¼å°±æ¯éçç°ç个æ°å¢å ï¼è¯¯å·®å½æ°è¶è¿äº0ï¼ææ端çæ
åµæ¯æ¯ä¸ªè®°å½å为ä¸ä¸ªåç¬çç°ï¼æ¤æ¶æ°æ®è®°å½ç误差为0ï¼ä½æ¯è¿æ ·èç±»ç»æ并ä¸æ¯æ们æ³è¦çï¼å¯ä»¥å¼å
¥ç»æé£é©å¯¹æ¨¡åçå¤æ度è¿è¡æ©ç½ï¼
ããK=mink[RSSmin(k)+λk]
ãã$\lambda$æ¯å¹³è¡¡è®ç»è¯¯å·®ä¸ç°ç个æ°çåæ°ï¼ä½æ¯ç°å¨çé®é¢ååæäºå¦ä½éå$\lambda$äºï¼æç 究[åèæç®1]æåºï¼å¨æ°æ®é满足é«æ¯åå¸æ¶ï¼$\lambda=2m$ï¼å
¶ä¸mæ¯åéç维度ã
ããå¦ä¸ç§æ¹æ³æ¯æéå¢ç顺åºå°è¯ä¸åçkå¼ï¼åæ¶ç»åºå
¶å¯¹åºç误差å¼ï¼éè¿å¯»æ±æç¹æ¥æ¾å°ä¸ä¸ªè¾å¥½çkå¼ï¼è¯¦æ
è§ä¸é¢çææ¬èç±»çä¾åã
ããk-meansææ¬èç±»
ããæç¬åäº36KRçé¨åæç« ï¼å
±1456ç¯ï¼åè¯å使ç¨sklearnè¿è¡k-meansèç±»ãåè¯åæ°æ®è®°å½å¦ä¸ï¼
ãã使ç¨TF-IDFè¿è¡ç¹å¾è¯çéåï¼ä¸å¾æ¯ä¸å¿ç¹ç个æ°ä»3å°80对åºç误差å¼çæ²çº¿ï¼
ããä»ä¸å¾ä¸å¨k=10å¤åºç°ä¸ä¸ªè¾ææ¾çæç¹ï¼å æ¤éæ©k=10ä½ä¸ºä¸å¿ç¹ç个æ°ï¼ä¸é¢æ¯10个ç°çæ°æ®éç个æ°ã
ãã{0: 152, 1: 239, 2: 142, 3: 61, 4: 119, 5: 44, 6: 71, 7: 394, 8: 141, 9: 93}
ããç°æ ç¾çæ
ããèç±»å®æåï¼æ们éè¦ä¸äºæ ç¾æ¥æè¿°ç°ï¼èç±»å®åï¼ç¸å½äºæ¯ä¸ªç±»é½ç¨ä¸ä¸ªç±»æ ï¼è¿æ¶åå¯ä»¥ç¨TFIDFãäºä¿¡æ¯ãå¡æ¹çæ¹æ³æ¥éåç¹å¾è¯ä½ä¸ºæ ç¾ãå
³äºå¡æ¹åäºä¿¡æ¯ç¹å¾æåå¯ä»¥çæä¹åçæç« ææ¬ç¹å¾éæ©ï¼ä¸é¢æ¯10个类çtfidfæ ç¾ç»æã
ããCluster 0: å家 åå ç©æµ åç æ¯ä» å¯¼è´ ç½ç« è´ç© å¹³å° è®¢å
ããCluster 1: æèµ èèµ ç¾å
å
¬å¸ èµæ¬ å¸åº è·å¾ å½å
ä¸å½ å»å¹´
ããCluster 2: ææº æºè½ 硬件 è®¾å¤ çµè§ è¿å¨ æ°æ® åè½ å¥åº· 使ç¨
ããCluster 3: æ°æ® å¹³å° å¸åº å¦ç app ç§»å¨ ä¿¡æ¯ å
¬å¸ å»ç æè²
ããCluster 4: ä¼ä¸ æè 人æ å¹³å° å
¬å¸ it ç§»å¨ ç½ç« å®å
¨ ä¿¡æ¯
ããCluster 5: 社交 好å 交å å® ç© åè½ æ´»å¨ æå åºäº å享 游æ
ããCluster 6: è®°è´¦ çè´¢ 贷款 é¶è¡ éè p2p æèµ äºèç½ åºé å
¬å¸
ããCluster 7: ä»»å¡ åä½ ä¼ä¸ éå® æ²é å·¥ä½ é¡¹ç® ç®¡ç å·¥å
· æå
ããCluster 8: æ
è¡ æ
游 é
åº é¢è®¢ ä¿¡æ¯ åå¸ æèµ å¼æ¾ app éæ±
ããCluster 9: è§é¢ å
容 游æ é³ä¹ å¾ç ç
§ç 广å é
读 å享 åè½
ããå®ç°ä»£ç
ãã#!--encoding=utf-8
ããfrom __future__ import print_function
ããfrom sklearn.feature_extraction.text import TfidfVectorizer
ããfrom sklearn.feature_extraction.text import HashingVectorizer
ããimport matplotlib.pyplot as plt
ããfrom sklearn.cluster import KMeans, MiniBatchKMeans
ããdef loadDataset():
ãã'''导å
¥ææ¬æ°æ®é'''
ããf = open('36krout.txt','r')
ããdataset = []
ããlastPage = None
ããfor line in f.readlines():
ããif '< title >' in line and '< / title >' in line:
ããif lastPage:
ããdataset.append(lastPage)
ããlastPage = line
ããelse:
ããlastPage += line
ããif lastPage:
ããdataset.append(lastPage)
ããf.close()
ããreturn dataset
ããdef transform(dataset,n_features=1000):
ããvectorizer = TfidfVectorizer(max_df=0.5, max_features=n_features, min_df=2,use_idf=True)
ããX = vectorizer.fit_transform(dataset)
ããreturn X,vectorizer
ããdef train(X,vectorizer,true_k=10,minibatch = False,showLable = False):
ãã#使ç¨éæ ·æ°æ®è¿æ¯åå§æ°æ®è®ç»k-meansï¼
ããif minibatch:
ããkm = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
ããinit_size=1000, batch_size=1000, verbose=False)
ããelse:
ããkm = KMeans(n_clusters=true_k, init='k-means++', max_iter=300, n_init=1,
ããverbose=False)
ããkm.fit(X)
ããif showLable:
ããprint("Top terms per cluster:")
ããorder_centroids = km.cluster_centers_.argsort()[:, ::-1]
ããterms = vectorizer.get_feature_names()
ããprint (vectorizer.get_stop_words())
ããfor i in range(true_k):
ããprint("Cluster %d:" % i, end='')
ããfor ind in order_centroids[i, :10]:
ããprint(' %s' % terms[ind], end='')
ããprint()
ããresult = list(km.predict(X))
ããprint ('Cluster distribution:')
ããprint (dict([(i, result.count(i)) for i in result]))
ããreturn -km.score(X)
ããdef test():
ãã'''æµè¯éæ©æä¼åæ°'''
ããdataset = loadDataset()
ããprint("%d documents" % len(dataset))
ããX,vectorizer = transform(dataset,n_features=500)
ããtrue_ks = []
ããscores = []
ããfor i in xrange(3,80,1):
ããscore = train(X,vectorizer,true_k=i)/len(dataset)
ããprint (i,score)
ããtrue_ks.append(i)
ããscores.append(score)
ããplt.figure(figsize=(8,4))
ããplt.plot(true_ks,scores,label="error",color="red",linewidth=1)
ããplt.xlabel("n_features")
ããplt.ylabel("error")
ããplt.legend()
ããplt.show()
ããdef out():
ãã'''å¨æä¼åæ°ä¸è¾åºèç±»ç»æ'''
ããdataset = loadDataset()
ããX,vectorizer = transform(dataset,n_features=500)
ããscore = train(X,vectorizer,true_k=10,showLable=True)/len(dataset)
ããprint (score)
ãã#test()
ããout()
温馨提示:内容为网友见解,仅供参考