Frank Zou's Blog: October 2017

IEEE Xplore 数据库提供了文献列表下载服务，可以方便地导出所查询文献的相关信息，存储与csv文件中。相关信息包括：文献名称、作者、发表年份、发表刊物等共计31个标签。在上一篇博文（批量下载IEEE Xplore数据库论文）中给出了根据csv文件下载文献的方法，在此做进一步补充：自定义下载的文献列表信息，例如只选择关注其中的若干标签，如：文献名称、作者、发表年份、发表刊物、引用次数这五个标签。此外，为每个条目添加文献超链接，关联到本地已下载的文献，从而便于索引和查看。

可以通过Matlab快速实现以上两个目标，涉及的主要函数如下：

xlsread / xlswrite
actxserver

其中，xlsread / xlswrite 分别读取和写入excel(csv)文件；而 axtxserver 可以创建windows的COM组件，从而操作该对象，例如：exl = axtxserver('excel.application') 可以创建一个Excel对象。以下给出通过Matlab为Excel单元格添加超链接的实现示例：

exl = actxserver('excel.application');
exlWkbk = exl.Workbooks;
exlFile = exlWkbk.Open([pwd '/' filename]);
exlSheet1 = exlFile.Sheets.Item('Sheet1');

rngObj = exlSheet1.get('Cells', row, col);
exlSheet1.Hyperlinks.Add(rngObj, 'somelink');

exlFile.Save()
exlFile.Close()
exl.Quit
exl.delete

在创建Excel对象后，可以调用Excel VBA中的方法对Excel单元格进行访问，而其中添加超链接的方式即： exlSheet1.Hyperlinks.Add(rngObj, 'somelink'); 值得注意的是以R1C1方式访问Excel单元格的方式为：rngObj = exlSheet1.get('Cells', row, col); 完整代码如下所示：

function SelectInterestTags(export, filename, savepath)
%% Select interested tags from IEEE Xplore export file
%   and generate an excel file which contains hyperlink for each entry to
%   locate the downloaded file
%
%   export: csv file downloaded from IEEE Xplore
% filename: saved excel filename
% savepath: path for downloaded pdf files
%
clc

%% initial
switch(nargin)
    case 2
        savepath = pwd;
    case 3
        % do nothing
    otherwise
        error('Wrong for number of inputs.')
end

%% load csv file
[raw_numerical, raw_text, RAW] = xlsread(export);
NameList = raw_text(3:end, 1);
YearList = raw_numerical(1:end, 2);

pat = '[\\/:*?"<>|]';
NameList = regexprep(NameList, pat, ' ');

%% disp all tags and choose interest tags
Tags = raw_text(2, :);
for k = 1 : length(Tags)
    fprintf('\t%d\t%s\n', k, Tags{k});
end

prompt = 'Please Select Your Interest Tags (-1 for all, 0 for default): ';
interestTags = input(prompt);

%% parameter check
if interestTags == -1
    interestTags = 1:length(Tags);
else
    if interestTags == 0 % default selection
        interestTags = [1, 22, 4, 6, 11, 17, 24, 2];
    end
end

disp('The following tags are selected: ')
disp(Tags(interestTags)')

%% write xls file
InterestArray = RAW([2:end], interestTags);
xlswrite(filename, InterestArray);

%% add hyperlink for each paper

exl = actxserver('excel.application');
exlWkbk = exl.Workbooks;
exlFile = exlWkbk.Open([pwd '/' filename]);
exlSheet1 = exlFile.Sheets.Item('Sheet1');

for k = 1 : length(InterestArray) - 1
    pdfFile = [savepath '/' num2str(YearList(k)) ' ' NameList{k} '.pdf'];
    if exist(pdfFile, 'file') == 2
        rngObj = exlSheet1.get('Cells', k + 1, 1);
        exlSheet1.Hyperlinks.Add(rngObj, pdfFile);
    end
end
disp([filename ' generated!'])

%% save file and close activex excel com
exlFile.Save()
exlFile.Close()
exl.Quit
exl.delete

参考

[1]. 使用 ActiveX 将数据写入 Excel 电子表格

[2]. Reference Excel cells in R1C1 style using ConvertFormula

[3]. Add a hyperlink in excell through matlab

注意事项

以上讨论基于具备IEEE Xplore访问权限的前提，一般校园网均具备；
在以文献标题作为文件名保存时需要注意通配符的问题，例如：“\/:*?"<>|”这些字符是无法存在于文件名中的，所以需要考虑将这些字符替换，比如替换为空格。可以通过正则表达式实现，Matlab中regexprep可用；
下载时最好设置相邻下载间的等待时间从而模拟人工操作避免被封IP，例如可以用： pause(30 * rand() + 30) 模拟随机的等待时间。

function DownloadPDFfromXplore(export, skip) %% download pdf from IEEE EXplore export file % if termites at any exception, we can restart and skip those downloaded if nargin == 1 skip = 0; end [raw_numerical, raw_text, ~] = xlsread(export); UrlList = raw_text(skip+3:end, 16); NameList = raw_text(skip+3:end, 1); YearList = raw_numerical(skip+1:end, 2); pat = '[\\/:*?"<>|]'; NameList = regexprep(NameList, pat, ' '); for k = 1 : length(NameList) html = webread(UrlList{k}); first = strfind(html, '<iframe src="h'); last = strfind(html, '" frameborder=0>'); url = html(first+13:last-1); disp(url) filename = [num2str(YearList(k)) ' ' NameList{k} '.pdf']; disp(filename) websave(filename, url); waitTime = 30 * rand() + 30; pause(waitTime); end

Frank Zou's Blog

Search This Blog

自定义IEEE Xplore文献列表

参考

批量下载IEEE Xplore数据库论文

注意事项

代码