維基百科主要的特徵向量
強調圖中節點相對重要性的一種經典方法是計算鄰接矩陣的主要特徵向量,以便將每個特徵向量的分量值作為中心性分數分配給每個節點:
在圖中的網頁和連結上,這些值被Google稱為PageRank分數, 本範例的目的為分析維基百科文章內部的鏈接圖,以根據此特徵向量中心性按相對重要性對文章進行排名。
計算主特徵向量的傳統方法是使用冪迭代方法:
在這要感謝Martinsson的隨機SVD算法,才能夠在scikit-learn中實現計算。 此範例的數據是從DBpedia轉儲中獲取,DBpedia 是一項從維基百科裡萃取結構化內容的專案計畫, 這些計畫所得的結構化資訊,也將放在網際網路中公開讓人取閱。

(一)引入函式庫

引入函式庫如下: 1. bz2 import BZ2File:用bzip2格式進行壓縮 2. import os:調用操作系統命令查詢文件 3. import datetime:計算日期 4. import time:計算時間 5. numpy as np:產生陣列數值 6. from scipy import sparse:產生稀疏矩陣 7. from joblib import Memory:將資料進行暫存 8. sklearn.decomposition import randomized_svd:計算分解的隨機SVD 9. urllib.request import urlopen:開啟URL網址

(二)載入壓縮檔

1
* ```redirects_en.nt.bz2
Copied!
    page_links_en.nt.bz2
1
redirects_url = "http://downloads.dbpedia.org/3.5.1/en/redirects_en.nt.bz2"
2
redirects_filename = redirects_url.rsplit("/", 1)[1]
3
4
page_links_url = "http://downloads.dbpedia.org/3.5.1/en/page_links_en.nt.bz2"
5
page_links_filename = page_links_url.rsplit("/", 1)[1]
6
7
resources = [
8
(redirects_url, redirects_filename),
9
(page_links_url, page_links_filename),
10
]
11
12
for url, filename in resources:
13
if not os.path.exists(filename):
14
print("Downloading data from '%s', please wait..." % url)
15
opener = urlopen(url)
16
open(filename, 'wb').write(opener.read())
17
print()
Copied!

(三)函數設置

Transitive Closure中文譯作遞移閉包,用來紀錄由一點能不能走到另一點的關係,如果能走到,則兩點之間以邊相連。
1
def get_redirects(redirects_filename):
2
"""分析重定向後,建立出遞移閉包的圖"""
3
redirects = {}
4
print("Parsing the NT redirect file")
5
for l, line in enumerate(BZ2File(redirects_filename)):
6
split = line.split()
7
if len(split) != 4:
8
print("ignoring malformed line: " + line)
9
continue
10
redirects[short_name(split[0])] = short_name(split[2])
11
if l % 1000000 == 0:
12
print("[%s] line: %08d" % (datetime.now().isoformat(), l))
13
14
# 計算遞移閉包
15
print("Computing the transitive closure of the redirect relation")
16
for l, source in enumerate(redirects.keys()):
17
transitive_target = None
18
target = redirects[source]
19
seen = {source}
20
while True:
21
transitive_target = target
22
target = redirects.get(target)
23
if target is None or target in seen:
24
break
25
seen.add(target)
26
redirects[source] = transitive_target
27
if l % 1000000 == 0:
28
print("[%s] line: %08d" % (datetime.now().isoformat(), l))
29
30
return redirects
Copied!

(四)鄰接矩陣

首先,先介紹何謂稀疏矩陣,對於一個矩陣而言, 若數值為零的元素遠遠多於非零元素的個數,且非零元素分佈沒有規律時,這樣的矩陣被稱作稀疏矩陣。 此函數提取鄰接的圖作為稀疏矩陣(sparse matrix), 並使用稀疏矩陣作為鄰接矩陣。
其中運用scipy的sparse.lil_matrix建立一個稀疏矩陣。
1
X = sparse.lil_matrix((len(index_map), len(index_map)), dtype=np.float32)
Copied!
定義鄰接矩陣之函數:
1
def get_adjacency_matrix(redirects_filename, page_links_filename, limit=None):
2
"""Extract the adjacency graph as a scipy sparse matrix
3
4
Redirects are resolved first.
5
6
Returns X, the scipy sparse adjacency matrix, redirects as python
7
dict from article names to article names and index_map a python dict
8
from article names to python int (article indexes).
9
"""
10
11
print("Computing the redirect map")
12
redirects = get_redirects(redirects_filename)
13
14
print("Computing the integer index map")
15
index_map = dict()
16
links = list()
17
for l, line in enumerate(BZ2File(page_links_filename)):
18
split = line.split()
19
if len(split) != 4:
20
print("ignoring malformed line: " + line)
21
continue
22
i = index(redirects, index_map, short_name(split[0]))
23
j = index(redirects, index_map, short_name(split[2]))
24
links.append((i, j))
25
if l % 1000000 == 0:
26
print("[%s] line: %08d" % (datetime.now().isoformat(), l))
27
28
if limit is not None and l >= limit - 1:
29
break
30
31
print("Computing the adjacency matrix")
32
X = sparse.lil_matrix((len(index_map), len(index_map)), dtype=np.float32)
33
for i, j in links:
34
X[i, j] = 1.0
35
del links
36
print("Converting to CSR representation")
37
X = X.tocsr()
38
print("CSR conversion done")
39
return X, redirects, index_map
Copied!
其中回傳三個值:
    X:計算出來的稀疏矩陣。
    redirects:python 的字典型態,存取文章名稱。
    index_map:python 的字典型態,存取文章名稱及索引。
為了能夠在RAM中工作,5M個連結後停止。
1
X, redirects, index_map = get_adjacency_matrix(
2
redirects_filename, page_links_filename, limit=5000000)
3
names = {i: name for name, i in index_map.items()}
Copied!

(五)計算奇異向量

使用randomized_svd計算SVD(奇異值分解)
1
print("Computing the principal singular vectors using randomized_svd")
2
t0 = time()
3
U, s, V = randomized_svd(X, 5, n_iter=3)
4
print("done in %0.3fs" % (time() - t0))
5
6
# 印出與維基百科相關的強元件的名稱
7
# 主奇異向量應與最高特徵向量相似
8
print("Top wikipedia pages according to principal singular vectors")
9
pprint([names[i] for i in np.abs(U.T[0]).argsort()[-10:]])
10
pprint([names[i] for i in np.abs(V[0]).argsort()[-10:]])
Copied!
Output:

(六)計算主要特徵向量之分數

定義中心性分數(centrality scores)之函數,用Power 迭代法計算主要特徵向量。 這個方法也適用於知名的Google PageRank上,並實施於NetworkX project(BSD授權條款) 其版權:
1
def centrality_scores(X, alpha=0.85, max_iter=100, tol=1e-10):
2
3
n = X.shape[0]
4
X = X.copy()
5
incoming_counts = np.asarray(X.sum(axis=1)).ravel()
6
7
print("Normalizing the graph")
8
for i in incoming_counts.nonzero()[0]:
9
X.data[X.indptr[i]:X.indptr[i + 1]] *= 1.0 / incoming_counts[i]
10
dangle = np.asarray(np.where(np.isclose(X.sum(axis=1), 0),
11
1.0 / n, 0)).ravel()
12
13
scores = np.full(n, 1. / n, dtype=np.float32) # initial guess
14
for i in range(max_iter):
15
print("power iteration #%d" % i)
16
prev_scores = scores
17
scores = (alpha * (scores * X + np.dot(dangle, prev_scores))
18
+ (1 - alpha) * prev_scores.sum() / n)
19
# check convergence: normalized l_inf norm
20
scores_max = np.abs(scores).max()
21
if scores_max == 0.0:
22
scores_max = 1.0
23
err = np.abs(scores - prev_scores).max() / scores_max
24
print("error: %0.6f" % err)
25
if err < n * tol:
26
return scores
27
28
return scores
29
30
print("Computing principal eigenvector score using a power iteration method")
31
t0 = time()
32
scores = centrality_scores(X, max_iter=100)
33
print("done in %0.3fs" % (time() - t0))
34
pprint([names[i] for i in np.abs(scores).argsort()[-10:]])
Copied!
Output:

(七)完整程式碼

Python source code:wikipedia_principal_eigenvector.py
1
# Author: Olivier Grisel <[email protected]>
2
# License: BSD 3 clause
3
4
from bz2 import BZ2File
5
import os
6
from datetime import datetime
7
from pprint import pprint
8
from time import time
9
10
import numpy as np
11
12
from scipy import sparse
13
14
from joblib import Memory
15
16
from sklearn.decomposition import randomized_svd
17
from urllib.request import urlopen
18
19
20
print(__doc__)
21
22
# #############################################################################
23
# Where to download the data, if not already on disk
24
redirects_url = "http://downloads.dbpedia.org/3.5.1/en/redirects_en.nt.bz2"
25
redirects_filename = redirects_url.rsplit("/", 1)[1]
26
27
page_links_url = "http://downloads.dbpedia.org/3.5.1/en/page_links_en.nt.bz2"
28
page_links_filename = page_links_url.rsplit("/", 1)[1]
29
30
resources = [
31
(redirects_url, redirects_filename),
32
(page_links_url, page_links_filename),
33
]
34
35
for url, filename in resources:
36
if not os.path.exists(filename):
37
print("Downloading data from '%s', please wait..." % url)
38
opener = urlopen(url)
39
open(filename, 'wb').write(opener.read())
40
print()
41
42
43
# #############################################################################
44
# 讀取重定向檔案
45
46
memory = Memory(cachedir=".")
47
48
49
def index(redirects, index_map, k):
50
"""重定向之後,找到文章名稱的索引"""
51
k = redirects.get(k, k)
52
return index_map.setdefault(k, len(index_map))
53
54
55
DBPEDIA_RESOURCE_PREFIX_LEN = len("http://dbpedia.org/resource/")
56
SHORTNAME_SLICE = slice(DBPEDIA_RESOURCE_PREFIX_LEN + 1, -1)
57
58
59
def short_name(nt_uri):
60
"""移除 < and > URI 記號以及普遍 URI 開頭"""
61
return nt_uri[SHORTNAME_SLICE]
62
63
64
def get_redirects(redirects_filename):
65
"""分析重定向後,建立出遞移閉包的圖"""
66
redirects = {}
67
print("Parsing the NT redirect file")
68
for l, line in enumerate(BZ2File(redirects_filename)):
69
split = line.split()
70
if len(split) != 4:
71
print("ignoring malformed line: " + line)
72
continue
73
redirects[short_name(split[0])] = short_name(split[2])
74
if l % 1000000 == 0:
75
print("[%s] line: %08d" % (datetime.now().isoformat(), l))
76
77
# 計算遞移閉包
78
print("Computing the transitive closure of the redirect relation")
79
for l, source in enumerate(redirects.keys()):
80
transitive_target = None
81
target = redirects[source]
82
seen = {source}
83
while True:
84
transitive_target = target
85
target = redirects.get(target)
86
if target is None or target in seen:
87
break
88
seen.add(target)
89
redirects[source] = transitive_target
90
if l % 1000000 == 0:
91
print("[%s] line: %08d" % (datetime.now().isoformat(), l))
92
93
return redirects
94
95
96
# disabling joblib as the pickling of large dicts seems much too slow
97
#@memory.cache
98
def get_adjacency_matrix(redirects_filename, page_links_filename, limit=None):
99
"""Extract the adjacency graph as a scipy sparse matrix
100
101
Redirects are resolved first.
102
103
Returns X, the scipy sparse adjacency matrix, redirects as python
104
dict from article names to article names and index_map a python dict
105
from article names to python int (article indexes).
106
"""
107
108
print("Computing the redirect map")
109
redirects = get_redirects(redirects_filename)
110
111
print("Computing the integer index map")
112
index_map = dict()
113
links = list()
114
for l, line in enumerate(BZ2File(page_links_filename)):
115
split = line.split()
116
if len(split) != 4:
117
print("ignoring malformed line: " + line)
118
continue
119
i = index(redirects, index_map, short_name(split[0]))
120
j = index(redirects, index_map, short_name(split[2]))
121
links.append((i, j))
122
if l % 1000000 == 0:
123
print("[%s] line: %08d" % (datetime.now().isoformat(), l))
124
125
if limit is not None and l >= limit - 1:
126
break
127
128
print("Computing the adjacency matrix")
129
X = sparse.lil_matrix((len(index_map), len(index_map)), dtype=np.float32)
130
for i, j in links:
131
X[i, j] = 1.0
132
del links
133
print("Converting to CSR representation")
134
X = X.tocsr()
135
print("CSR conversion done")
136
return X, redirects, index_map
137
138
139
# 為了能夠在RAM中工作,5M個連結後停止。
140
X, redirects, index_map = get_adjacency_matrix(
141
redirects_filename, page_links_filename, limit=5000000)
142
names = {i: name for name, i in index_map.items()}
143
144
print("Computing the principal singular vectors using randomized_svd")
145
t0 = time()
146
U, s, V = randomized_svd(X, 5, n_iter=3)
147
print("done in %0.3fs" % (time() - t0))
148
149
# 印出與維基百科相關的強元件的名稱
150
# 主奇異向量應與最高特徵向量相似
151
print("Top wikipedia pages according to principal singular vectors")
152
pprint([names[i] for i in np.abs(U.T[0]).argsort()[-10:]])
153
pprint([names[i] for i in np.abs(V[0]).argsort()[-10:]])
154
155
156
def centrality_scores(X, alpha=0.85, max_iter=100, tol=1e-10):
157
"""Power iteration computation of the principal eigenvector
158
159
This method is also known as Google PageRank and the implementation
160
is based on the one from the NetworkX project (BSD licensed too)
161
with copyrights by:
162
163
Aric Hagberg <[email protected]>
164
Dan Schult <[email protected]>
165
Pieter Swart <[email protected]>
166
"""
167
n = X.shape[0]
168
X = X.copy()
169
incoming_counts = np.asarray(X.sum(axis=1)).ravel()
170
171
print("Normalizing the graph")
172
for i in incoming_counts.nonzero()[0]:
173
X.data[X.indptr[i]:X.indptr[i + 1]] *= 1.0 / incoming_counts[i]
174
dangle = np.asarray(np.where(np.isclose(X.sum(axis=1), 0),
175
1.0 / n, 0)).ravel()
176
177
scores = np.full(n, 1. / n, dtype=np.float32) # initial guess
178
for i in range(max_iter):
179
print("power iteration #%d" % i)
180
prev_scores = scores
181
scores = (alpha * (scores * X + np.dot(dangle, prev_scores))
182
+ (1 - alpha) * prev_scores.sum() / n)
183
# check convergence: normalized l_inf norm
184
scores_max = np.abs(scores).max()
185
if scores_max == 0.0:
186
scores_max = 1.0
187
err = np.abs(scores - prev_scores).max() / scores_max
188
print("error: %0.6f" % err)
189
if err < n * tol:
190
return scores
191
192
return scores
193
194
print("Computing principal eigenvector score using a power iteration method")
195
t0 = time()
196
scores = centrality_scores(X, max_iter=100)
197
print("done in %0.3fs" % (time() - t0))
198
pprint([names[i] for i in np.abs(scores).argsort()[-10:]])
Copied!
Last modified 1yr ago