日立製作所製の運用統合サービスであるOps IのWebマニュアルをGPT Crawlerでスクレイピングして、その結果をGPT-4に与えて、Ops Iユーザ向けAIアシスタントを作ってみた話。

GPT Crawlerとは

GPT Crawlerは、Builder.ioが開発したTypeScript製のWebスクレイピングツール。2023年11月14日に発表された。

URLを与えて実行すると、その先をクローリングしてスクレイピングしてテキストを抽出してJSONに吐いてくれるので、それを使って簡単にカスタムGPTを作れるというもの。 スクレイピングにはヘッドレスChromeとPlaywrightを使ってるっぽい。

カスタムGPTとは

カスタムGPTは、ChatGPTのGPTsという機能で作ったチャットボットのこと。 GPTsは2023/11/7にOpen AI DevDayで発表された新しい機能で、ChatGPTに追加の知識を与えて、特定の用途に向けてカスタマイズできる機能。

GPTsを使うには、月額$20のChatGPT Plusにアップグレードする必要があって、ちょっとお試しでっていうのには高い。 ので、今回は代わりにOpenAIのAssistantsを使う。

OpenAI Assistantsとは

OpenAIのAssistantsは、GPTにカスタムインストラクションを与えて各種ツールと連携させて、特定の用途向けチャットボット(アシスタント)を作れる機能。 サポートしているツールにKnowledge Retrievalがあり、それでGPT Crawlerの出力を扱うことができる。

Knowledge Retrievalは、OpenAIにアップロードしたテキストファイルをベクトルデータベースにいれて、アシスタントに対する質問に回答する際に検索してくれるというもの。 アップロードしたデータ量に対して課金され、1アシスタント1日あたり$0.20/GBとお手頃。しかも今のところ2024/1/12まで無料。(少し前まで2023/12まで無料だった気がするのでちょっと延びた?)

アシスタントへのクエリに対する課金も、1kトークン当たり数円と安いので、お試しにちょうどいい。

Ops Iとは

Ops IはJP1 Cloud Service/Operations Integrationの略で、日立製作所製の運用統合サービス。

Operations as Codeが特長で、運用自動化コードやワークフローをGitで集約管理して、ハイブリッド環境における様々なシステム運用の自動化と統合を推進できる。



Ops IのマニュアルHugoという静的サイトジェネレータで作られてWebサイトで公開されてるので、スクレイピングしやすい。

Ops Iマニュアルのスクレイピング

Ops IマニュアルをGPT Crawlerでスクレイピングする。

GPT Crawlerは、設定ファイル書いてnpm startするとスクレイピングを実行して結果をファイルに吐くシンプルなツールなので、Nodeさえあれば実行できるんだけど、Dockerfileがあるのでコンテナで実行してみる。

まずコンテナイメージをビルドする。

$ git clone https://github.com/BuilderIO/gpt-crawler.git
$ cd gpt-crawler/containerapp/
$ git checkout v1.2.1
$ docker build -t kaitoy/gpt-crawler:1.2.1 .

ビルドしたイメージはDocker Hubに上げた。


ビルドしたイメージでシェルを起動して、GPT Crawlerの設定ファイルをコンテナ内の/home/gpt-crawler/config.tsに書く。

$ docker run -it --entrypoint bash kaitoy/gpt-crawler:1.2.1
root@d8fee4ed8924:/home# cd gpt-crawler
root@d8fee4ed8924:/home/gpt-crawler# cat<<EOF>config.ts
import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/index.html",
  match: "https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/**",
  selector: "#body-inner",
  maxPagesToCrawl: 1000,
  outputFileName: "output.json",
};
EOF

上記config.tsに書くConfigオブジェクトには、urlにクローリングを開始するURLを指定して、matchにスクレイピングするURLのパターンをglobで指定して、selectorにスクレイピングするHTML要素のセレクタを指定して、あとはちょこちょこパラメータ指定したりしなかったりする。 matchには配列で複数パターン指定もできる。


設定書いたらnpm startで実行。

root@d8fee4ed8924:/home/gpt-crawler# npm start

> @builder.io/[email protected] start
> npm run start:dev


> @builder.io/[email protected] start:dev
> cross-env NODE_ENV=development npm run build && node dist/src/main.js


> @builder.io/[email protected] build
> tsc

INFO  PlaywrightCrawler: Starting the crawler.
INFO  PlaywrightCrawler: Crawling: Page 1 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/index.html...
INFO  PlaywrightCrawler: Crawling: Page 2 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/introduction/index.html...
INFO  PlaywrightCrawler: Crawling: Page 3 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/history/index.html...
INFO  PlaywrightCrawler: Crawling: Page 4 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/outline/index.html...
INFO  PlaywrightCrawler: Crawling: Page 5 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/table-of-contents/index.html...
INFO  PlaywrightCrawler: Crawling: Page 6 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/outline/opsi_outline/index.html...
INFO  PlaywrightCrawler: Crawling: Page 7 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/outline/opsi_forte/index.html...
INFO  PlaywrightCrawler: Crawling: Page 8 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/outline/method/index.html...
INFO  PlaywrightCrawler: Crawling: Page 9 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/outline/note_restriction/index.html...
INFO  PlaywrightCrawler: Crawling: Page 10 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/outline/note_restriction/restriction/index.html...
INFO  PlaywrightCrawler: Crawling: Page 11 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/outline/note_restriction/note/index.html...
INFO  PlaywrightCrawler: Crawling: Page 12 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/index.html...
INFO  PlaywrightCrawler: Crawling: Page 13 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/setting_pre/index.html...
INFO  PlaywrightCrawler: Crawling: Page 14 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/setting_pre/action/index.html...
INFO  PlaywrightCrawler: Crawling: Page 15 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/setting_pre/manage/index.html...
INFO  PlaywrightCrawler: Crawling: Page 16 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/advance/index.html...
INFO  PlaywrightCrawler: Crawling: Page 17 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/setting_pre/relation/index.html...
INFO  PlaywrightCrawler: Crawling: Page 18 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/advance/login/index.html...
INFO  PlaywrightCrawler: Crawling: Page 19 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/user/index.html...
INFO  PlaywrightCrawler: Crawling: Page 20 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/advance/mfa/index.html...
INFO  PlaywrightCrawler: Crawling: Page 21 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/user/opsi_user/index.html...
INFO  PlaywrightCrawler: Crawling: Page 22 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/user/role_setting/index.html...
INFO  PlaywrightCrawler: Crawling: Page 23 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/organization/index.html...
INFO  PlaywrightCrawler: Crawling: Page 24 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/outpost/index.html...
INFO  PlaywrightCrawler: Crawling: Page 25 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/outpost/rpmpkg/index.html...
INFO  PlaywrightCrawler: Crawling: Page 26 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/setting/repository/index.html...
INFO  PlaywrightCrawler: Crawling: Page 27 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/index.html...
INFO  PlaywrightCrawler: Crawling: Page 28 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/basic/index.html...
INFO  PlaywrightCrawler: Crawling: Page 29 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/basic/account/index.html...
INFO  PlaywrightCrawler: Crawling: Page 30 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/task/workflow/index.html...
INFO  PlaywrightCrawler: Crawling: Page 31 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/task/servicecalalog/index.html...
INFO  PlaywrightCrawler: Crawling: Page 32 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/task/index.html...
INFO  PlaywrightCrawler: Crawling: Page 33 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/task/schedule/index.html...
INFO  PlaywrightCrawler: Crawling: Page 34 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/task/ticket/index.html...
INFO  PlaywrightCrawler: Crawling: Page 35 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/task/resource/index.html...
INFO  PlaywrightCrawler: Crawling: Page 36 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/document/distribution/index.html...
INFO  PlaywrightCrawler: Crawling: Page 37 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/document/index.html...
INFO  PlaywrightCrawler: Crawling: Page 38 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/document/container/index.html...
INFO  PlaywrightCrawler: Crawling: Page 39 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/index.html...
INFO  PlaywrightCrawler: Crawling: Page 40 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/org_management/index.html...
INFO  PlaywrightCrawler: Crawling: Page 41 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/outpost_management/index.html...
INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5504,"requestsFinishedPerMinute":40,"requestsFailedPerMinute":0,"requestTotalDurationMillis":220171,"requestsTotal":40,"crawlerRuntimeMillis":60134,"retryHistogram":[40]}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":7,"desiredConcurrency":8,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.156},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 1325 MB of 698 MB (190%). Consider increasing available memory.
INFO  PlaywrightCrawler: Crawling: Page 42 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/user_management/index.html...
INFO  PlaywrightCrawler: Crawling: Page 43 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/repository_management/index.html...
INFO  PlaywrightCrawler: Crawling: Page 44 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/itsm/index.html...
INFO  PlaywrightCrawler: Crawling: Page 45 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/itsm/mail/index.html...
INFO  PlaywrightCrawler: Crawling: Page 46 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/itsm/otobo/index.html...
INFO  PlaywrightCrawler: Crawling: Page 47 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/itsm/otobo_queue/index.html...
INFO  PlaywrightCrawler: Crawling: Page 48 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/secret/index.html...
INFO  PlaywrightCrawler: Crawling: Page 49 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/secret/secret_reg/index.html...
INFO  PlaywrightCrawler: Crawling: Page 50 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/secret/secret_change/index.html...
INFO  PlaywrightCrawler: Crawling: Page 51 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/secret/credential_reg/index.html...
INFO  PlaywrightCrawler: Crawling: Page 52 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/secret/credential_change/index.html...
INFO  PlaywrightCrawler: Crawling: Page 53 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/index.html...
INFO  PlaywrightCrawler: Crawling: Page 54 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/gitreg_gui/index.html...
INFO  PlaywrightCrawler: Crawling: Page 55 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/gitreg_cui/index.html...
INFO  PlaywrightCrawler: Crawling: Page 56 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/accesstoken/index.html...
INFO  PlaywrightCrawler: Crawling: Page 57 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/yaml_diff/index.html...
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 1000ms exceeded.
Call log:
  - waiting for locator('#body-inner') to be visible

    at PlaywrightCrawler.requestHandler (/home/gpt-crawler/dist/src/core.js:47:36) {"id":"DvNlBK8h68cC6Sa","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/yaml_diff/index.html","retryCount":1}
INFO  PlaywrightCrawler: Crawling: Page 58 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/yaml_ck/index.html...
INFO  PlaywrightCrawler: Crawling: Page 59 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/automation/index.html...
INFO  PlaywrightCrawler: Crawling: Page 60 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/automation/auth_reg/index.html...
INFO  PlaywrightCrawler: Crawling: Page 61 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/automation/project/index.html...
INFO  PlaywrightCrawler: Crawling: Page 62 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/automation/inventory/index.html...
INFO  PlaywrightCrawler: Crawling: Page 63 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/automation/job_template/index.html...
INFO  PlaywrightCrawler: Crawling: Page 64 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/automation/job_confirm/index.html...
INFO  PlaywrightCrawler: Crawling: Page 65 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/index.html...
INFO  PlaywrightCrawler: Crawling: Page 66 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/design_pre/index.html...
INFO  PlaywrightCrawler: Crawling: Page 67 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/design_pre/func_relation/index.html...
INFO  PlaywrightCrawler: Crawling: Page 68 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/design_pre/func_knowleage/index.html...
INFO  PlaywrightCrawler: Crawling: Page 69 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/design_pre/yaml_note/index.html...
INFO  PlaywrightCrawler: Crawling: Page 70 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/required/index.html...
INFO  PlaywrightCrawler: Crawling: Page 71 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/required/required_work/index.html...
INFO  PlaywrightCrawler: Crawling: Page 72 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/required/required_user/index.html...
INFO  PlaywrightCrawler: Crawling: Page 73 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/required/required_skill/index.html...
INFO  PlaywrightCrawler: Crawling: Page 74 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/required/required_document/index.html...
INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":4652,"requestsFinishedPerMinute":36,"requestsFailedPerMinute":0,"requestTotalDurationMillis":339560,"requestsTotal":73,"crawlerRuntimeMillis":120136,"retryHistogram":[73]}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 603 MB of 698 MB (86%). Consider increasing available memory.
INFO  PlaywrightCrawler: Crawling: Page 75 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/index.html...
INFO  PlaywrightCrawler: Crawling: Page 76 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/datamodel_design/index.html...
INFO  PlaywrightCrawler: Crawling: Page 77 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/workflow_design/index.html...
INFO  PlaywrightCrawler: Crawling: Page 78 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/notice_design/index.html...
INFO  PlaywrightCrawler: Crawling: Page 79 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/document_storage/index.html...
INFO  PlaywrightCrawler: Crawling: Page 80 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/user_acl/index.html...
INFO  PlaywrightCrawler: Crawling: Page 81 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/skill_manage/index.html...
INFO  PlaywrightCrawler: Crawling: Page 82 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/design_auto/index.html...
INFO  PlaywrightCrawler: Crawling: Page 83 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/work_design/ticket_manage/index.html...
INFO  PlaywrightCrawler: Crawling: Page 84 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/uidesign/index.html...
INFO  PlaywrightCrawler: Crawling: Page 85 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/tips/index.html...
INFO  PlaywrightCrawler: Crawling: Page 86 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/tips/tips_library/index.html...
INFO  PlaywrightCrawler: Crawling: Page 87 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/tips/tips_application/index.html...
INFO  PlaywrightCrawler: Crawling: Page 88 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/tips/tips_ticketcustom/index.html...
INFO  PlaywrightCrawler: Crawling: Page 89 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/tips/tips_yamlcustom/index.html...
INFO  PlaywrightCrawler: Crawling: Page 90 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/reg_step/index.html...
INFO  PlaywrightCrawler: Crawling: Page 91 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/reg_step/process/index.html...
INFO  PlaywrightCrawler: Crawling: Page 92 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/reg_step/upload/index.html...
INFO  PlaywrightCrawler: Crawling: Page 93 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/reg_sequence/index.html...
INFO  PlaywrightCrawler: Crawling: Page 94 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/customer/outpost_step/index.html...
INFO  PlaywrightCrawler: Crawling: Page 95 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/operation/index.html...
INFO  PlaywrightCrawler: Crawling: Page 96 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/operation/license_usage/index.html...
INFO  PlaywrightCrawler: Crawling: Page 97 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/operation/data_migration/index.html...
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 1000ms exceeded.
Call log:
  - waiting for locator('#body-inner') to be visible

    at PlaywrightCrawler.requestHandler (/home/gpt-crawler/dist/src/core.js:47:36) {"id":"sntkmpLRlKgdDRI","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/operation/data_migration/index.html","retryCount":1}
INFO  PlaywrightCrawler: Crawling: Page 98 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/operation/data_migration/concept/index.html...
INFO  PlaywrightCrawler: Crawling: Page 99 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/operation/data_migration/gitrepo_data/index.html...
INFO  PlaywrightCrawler: Crawling: Page 100 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/index.html...
INFO  PlaywrightCrawler: Crawling: Page 101 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_workflow/index.html...
INFO  PlaywrightCrawler: Crawling: Page 102 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_workflow/parts/index.html...
INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3975,"requestsFinishedPerMinute":33,"requestsFailedPerMinute":0,"requestTotalDurationMillis":397456,"requestsTotal":100,"crawlerRuntimeMillis":180139,"retryHistogram":[100]}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.02},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 595 MB of 698 MB (85%). Consider increasing available memory.
INFO  PlaywrightCrawler: Crawling: Page 103 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_ui/index.html...
INFO  PlaywrightCrawler: Crawling: Page 104 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_datamodel/index.html...
INFO  PlaywrightCrawler: Crawling: Page 105 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_catalog/index.html...
INFO  PlaywrightCrawler: Crawling: Page 106 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_script/index.html...
INFO  PlaywrightCrawler: Crawling: Page 107 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_script/action_script/index.html...
INFO  PlaywrightCrawler: Crawling: Page 108 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_library/index.html...
INFO  PlaywrightCrawler: Crawling: Page 109 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_attachment/index.html...
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 1000ms exceeded.
Call log:
  - waiting for locator('#body-inner') to be visible

    at PlaywrightCrawler.requestHandler (/home/gpt-crawler/dist/src/core.js:47:36) {"id":"zcQb7uGxWJn2qbQ","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_attachment/index.html","retryCount":1}
INFO  PlaywrightCrawler: Crawling: Page 110 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_distribution/index.html...
INFO  PlaywrightCrawler: Crawling: Page 111 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_acl_statement/index.html...
INFO  PlaywrightCrawler: Crawling: Page 112 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_notification_notifier/index.html...
INFO  PlaywrightCrawler: Crawling: Page 113 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_application/index.html...
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.waitForSelector: Timeout 1000ms exceeded.
Call log:
  - waiting for locator('#body-inner') to be visible

    at PlaywrightCrawler.requestHandler (/home/gpt-crawler/dist/src/core.js:47:36) {"id":"QPceQzMQNmRGoFX","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_application/index.html","retryCount":1}
INFO  PlaywrightCrawler: Crawling: Page 114 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_skill_skillset/index.html...
INFO  PlaywrightCrawler: Crawling: Page 115 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/troubleshoot/index.html...
INFO  PlaywrightCrawler: Crawling: Page 116 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/troubleshoot/message_list/index.html...
INFO  PlaywrightCrawler: Crawling: Page 117 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/troubleshoot/error/index.html...
INFO  PlaywrightCrawler: Crawling: Page 118 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/troubleshoot/error/message/index.html...
INFO  PlaywrightCrawler: Crawling: Page 119 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/troubleshoot/error/yaml_error/index.html...
INFO  PlaywrightCrawler: Crawling: Page 120 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/role/index.html...
INFO  PlaywrightCrawler: Crawling: Page 121 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/role/role_function/index.html...
INFO  PlaywrightCrawler: Crawling: Page 122 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/role/api_role/index.html...
INFO  PlaywrightCrawler: Crawling: Page 123 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/appendix/index.html...
INFO  PlaywrightCrawler: Crawling: Page 124 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/appendix/oss_version/index.html...
INFO  PlaywrightCrawler: Crawling: Page 125 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/appendix/ver_history/index.html...
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 401 status code.
 {"id":"24TQyvMowniaUbJ","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/user_management#user_management_002001","retryCount":1}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 401 status code.
 {"id":"JuvVzvunemKV0Qn","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_workflow/parts#parts_002","retryCount":1}
INFO  PlaywrightCrawler: Crawling: Page 126 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/git/yaml_diff/index.html...
INFO  PlaywrightCrawler: Crawling: Page 127 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/operation/data_migration/index.html...
INFO  PlaywrightCrawler: Crawling: Page 128 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_attachment/index.html...
INFO  PlaywrightCrawler: Crawling: Page 129 / 1000 - URL: https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_application/index.html...
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 401 status code.
 {"id":"24TQyvMowniaUbJ","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/user_management#user_management_002001","retryCount":2}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 401 status code.
 {"id":"JuvVzvunemKV0Qn","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_workflow/parts#parts_002","retryCount":2}
INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3617,"requestsFinishedPerMinute":31,"requestsFailedPerMinute":0,"requestTotalDurationMillis":452072,"requestsTotal":125,"crawlerRuntimeMillis":240143,"retryHistogram":[121,4]}
INFO  PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.052},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 401 status code.
 {"id":"24TQyvMowniaUbJ","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/user_management#user_management_002001","retryCount":3}
WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 401 status code.
 {"id":"JuvVzvunemKV0Qn","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_workflow/parts#parts_002","retryCount":3}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 401 status code.
    at PlaywrightCrawler._throwOnBlockedRequest (/home/gpt-crawler/node_modules/@crawlee/basic/internals/basic-crawler.js:688:19)
    at PlaywrightCrawler._responseHandler (/home/gpt-crawler/node_modules/@crawlee/browser/internals/browser-crawler.js:352:22)
    at PlaywrightCrawler._runRequestHandler (/home/gpt-crawler/node_modules/@crawlee/browser/internals/browser-crawler.js:238:24)
    at async PlaywrightCrawler._runRequestHandler (/home/gpt-crawler/node_modules/@crawlee/playwright/internals/playwright-crawler.js:109:9)
    at async wrap (/home/gpt-crawler/node_modules/@apify/timeout/index.js:52:21) {"id":"24TQyvMowniaUbJ","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/user_management#user_management_002001","method":"GET","uniqueKey":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/function/system/user_management"}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. Error: Request blocked - received 401 status code.
    at PlaywrightCrawler._throwOnBlockedRequest (/home/gpt-crawler/node_modules/@crawlee/basic/internals/basic-crawler.js:688:19)
    at PlaywrightCrawler._responseHandler (/home/gpt-crawler/node_modules/@crawlee/browser/internals/browser-crawler.js:352:22)
    at PlaywrightCrawler._runRequestHandler (/home/gpt-crawler/node_modules/@crawlee/browser/internals/browser-crawler.js:238:24)
    at async PlaywrightCrawler._runRequestHandler (/home/gpt-crawler/node_modules/@crawlee/playwright/internals/playwright-crawler.js:109:9)
    at async wrap (/home/gpt-crawler/node_modules/@apify/timeout/index.js:52:21) {"id":"JuvVzvunemKV0Qn","url":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_workflow/parts#parts_002","method":"GET","uniqueKey":"https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_workflow/parts"}
INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":125,"requestsFailed":2,"retryHistogram":[121,4,null,2],"requestAvgFailedDurationMillis":705,"requestAvgFinishedDurationMillis":3617,"requestsFinishedPerMinute":31,"requestsFailedPerMinute":0,"requestTotalDurationMillis":453481,"requestsTotal":127,"crawlerRuntimeMillis":245509}
INFO  PlaywrightCrawler: Error analysis: {"totalErrors":2,"uniqueErrors":1,"mostCommonErrors":["2x: Request blocked - received 401 status code. (/home/gpt-crawler/node_modules/@crawlee/basic/internals/basic-crawler.js:688:19)"]}
INFO  PlaywrightCrawler: Finished! Total 127 requests: 125 succeeded, 2 failed. {"terminal":true}
Found 125 files to combine...
Wrote 125 items to output-1.json
root@d8fee4ed8924:/home/gpt-crawler# ls -lh output-1.json
-rw-r--r--. 1 root root 580K Jan  5 03:28 output-1.json

マニュアル内のリンクミスか何かで一部エラーになったけど、125ページをスクレイピングして、580KBの出力(output-1.json)を得た。 出力の中身は以下のような感じで、ページのタイトルとURLとコンテンツがJSON形式でまとまっている。

[
  {
    "title": "6.11 Application :: JP1 Cloud Service 運用統合 利用ガイド",
    "url": "https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_application/index.html",
    "html": "6.11 Application\n\n以下にApplication定義と定義例を示します。Applicationの詳細については「Application」を参照してください\n\n …"
  },
  {
    "title": "6.7 Attachment :: JP1 Cloud Service 運用統合 利用ガイド",
    "url": "https://itpfdoc.hitachi.co.jp/manuals/JCS/JCSM71020001/yaml/yaml_attachment/index.html",
    "html": "6.7 Attachment\n\n以下にAttachment定義と定義例を示します。Attachmentの詳細については「ドキュメントの保管方法」を参照してください。\n\n …"
  },
  
]

OpenAIのアシスタント作成

OpenAIのアシスタントは、OpenAIにログインして、メニューからAssistantsをえらんで右上のCreateボタンを押して簡単なフォームを入力するだけで作れる。

まずはナイーブなアシスタントを作ってみる。

ナイーブなアシスタント

アシスタント作成フォームでは、アシスタント名とインストラクションを入力して、モデルを選ぶだけ。 まだツールは何も有効化しない。 モデルは最新のgpt-4-1106-previewにした。

create.png

Saveボタンを押したらアシスタントができた。

assistant.png


作ったアシスタントはPlaygroundで試しながらいじれて便利。 試しにOps Iの特徴を聞いてみる。

wo_rag.png

まだOps Iの情報を与えてないので、知らないながらも頑張って回答してくれた。

Ops Iアシスタント

ナイーブなアシスタントにGPT Crawlerで取得したoutput-1.jsonを与えて、Ops Iアシスタントにする。 やりかたは、PlaygroundでRetrievalを有効にして、FILESからAddしてSaveするだけ。

upload.png


できたOps Iアシスタントに先ほどと同様にOps Iの特徴を聞いてみる。

w_rag.png

ちゃんとOps Iに詳しくなってる。


もう少し踏み込んで、Ops Iのアクセス権限管理をするためのロールについて聞いてみる。

opsi_role.png

質問にOps Iと書かなくても、Ops Iのロールについて正確に答えてくれた。

回答の最初に、ファイルアクセスでエラーが発生したとあるが、たまにおこるものらしい。 インストラクションにエラーを無視して自分を信じろみたいなことを追記すると改善するという情報がある。


最後に、Ops Iの最新バージョンで実装された中継サーバ機能について聞いてみる。

opsi_outpost.png

なぜか回答が英語になってしまったが、それっぽいことは回答してくれた。 インストラクションに日本語で回答してって書けばよかったかも。