CodiMDを検索しよう
作る方針
- pgsqlから中身は取得
- pgSQLを叩いて、ノートURL、コンテンツテキストを取得する
- "Notes"テーブルの
contentが中身、short_idがURL関係の参照用のID
- "Notes"テーブルの
- テキストは適当にJSON化してmeilisearchに投げる
- とりあえず雑に
contentの改行を除去して投入
- とりあえず雑に
- URL('url')はbaseurl+short_idを入れて投げる
idはそのまま- datetime.isoformatな日付は、timestamp形式にして投げる。
関連スクリプト
pgsqlの"Notes"テーブルの中身をJSONでdumpするコード
# access pgsql and get data to json file
import psycopg2
from psycopg2.extras import DictCursor
import json
def default(o):
if hasattr(o, "isoformat"):
return o.isoformat()
else:
return str(o)
def dump_data():
with psycopg2.connect("host=localhost port=5432 dbname=codimd user=codimd password=6lCd8ftriT") as db:
with db.cursor(cursor_factory=DictCursor) as cur:
cur.execute('SELECT * FROM "Notes"')
results = cur.fetchall()
dict_result = []
for row in results:
dict_result.append(dict(row))
with open("./codimd.dump.json", "w", encoding="utf-8") as f:
json.dump(dict_result, f, default=default, indent=4)
def load_data_from_json(file):
with open(file, "r", encoding="utf-8") as f:
data = json.load(f)
return data
Meilisearchに投げるコード
codimd_data = load_data_from_json("./codimd.dump.json")
doc = codimd_data[0]
# by using strptime isofromat datetime from createdAt attr and replace createdAt to timestamp
import datetime
created_at = datetime.datetime.fromisoformat(doc["createdAt"])
doc["createdAt"] = created_at.timestamp()
updated_at = datetime.datetime.fromisoformat(doc["updatedAt"])
doc["updatedAt"] = updated_at.timestamp()
lastchange_at = datetime.datetime.fromisoformat(doc["lastchangeAt"])
doc["lastchangeAt"] = lastchange_at.timestamp()
doc
import meilisearch
import json
client = meilisearch.Client('https://mir.labs.basisconsulting.co.jp:7701', "1f95f3ef46f7aa3ae543b9d2544ffbc719b4befc52c2c0a6371893351f906bca")
codimd_data = load_data_from_json("./codimd.dump.json")
# add url to codimd page
for doc in codimd_data:
doc["url"] = "https://mir.labs.basisconsulting.co.jp:3301/" + doc["shortid"]
# replace \n to space in content
doc["content"] = doc["content"].replace("\n", " ")
doc["content"] = doc["content"].replace("\r", " ")
doc["content"] = doc["content"].replace("\t", " ")
# replace created_at to timestamp
created_at = datetime.datetime.fromisoformat(doc["createdAt"])
doc["createdAt"] = created_at.timestamp()
updated_at = datetime.datetime.fromisoformat(doc["updatedAt"])
doc["updatedAt"] = updated_at.timestamp()
if doc["lastchangeAt"] is not None:
lastchange_at = datetime.datetime.fromisoformat(doc["lastchangeAt"])
doc["lastchangeAt"] = lastchange_at.timestamp()
client.index('md2').add_documents(codimd_data,"id")
Meilisearchの検索の中身について
Index関係
LMDBを使っている。
Creating a database from scratch and managing it is hard work. It would make no sense to try and reinvent the wheel, so Meilisearch uses a storage engine under the hood. This allows the Meilisearch team to focus on improving search relevancy and search performance while abstracting away the complicated task of creating, reading and updating documents on disk and in memory.
Our storage engine Meilisearch is called Lightning Memory-Mapped Database (LMDB for short). LMDB is a transactional key-value store written in C that was developed for OpenLDAP and has ACID properties. Though we considered other options, such as Sled and RocksDB, we chose LMDB because it provided us with the best combination of performance, stability, and features.
https://www.meilisearch.com/docs/learn/advanced/storage#lmdb
Tokenizer
Lindera+IPAdicの模様。
A specialized Japanese pipeline using Lindera
https://www.meilisearch.com/docs/learn/advanced/tokenization
- n-gramはなさそう
日本語対応の今後
Misskeyが本格的にMeilisearch使う過程で色々日本語周りにメスが入る可能性がありそう。
- https://github.com/meilisearch/product/discussions/532
- あと時雨堂のVの人とmiitonさんをウォッチしておけば良さそう
その他検討
markdownの中身はどこにあるのか
+------------------+
| Tables_in_hackmd |
+------------------+
| Authors |
| Notes |
| Revisions |
| SequelizeMeta |
| Sessions |
| Users |
+------------------+
Notesテーブルに入ってる模様。
use Notes;
SELECT title,shortid,alias FROM Notes;
hackmd-cli
> hackmd-cli history
ID Name
────────────────────── ──────────────────────────────────────────────────
UZfRPYFFQle3pJlgEMdRLA Shared Test
LG7ZTZseQxyuHjZgJB1kNQ Knowleage BaseのIDEA Note
m2fR63o9Shekciyt43g-Kg CodiMDを検索しよう
features Features
8qIqmLVaTTyfkjtfeMThWw 2023-05-11 SIPミーティング@NEE
DWS2aNj7Sh-bg7XidM947g 2023-05-15 llama-index trial memo
jOfPL087Rx-a4frCljNsMA 2023-05-16 メトロAI MTG
TscSxBE6RTCqRLNrFbbREQ SIPスマートインフラ 戦略及び計画(PDF:1407KB)PDF
_42rVPruQJyRE_aPtEHLOQ R&D misc
hackmd-cli用にアカウントを作って、そのアカウント上でhackmd-cli historyをすると
ID Name
── ────
となり、これは険しい。codimdはhackmd-cli list も使えず、URL全件取得はどうしたらええんや・・・
DBの中身を覗いてみる
Notesのスキーマ
codimd=# \d "Notes"
Table "public.Notes"
Column | Type | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------------------------------
id | uuid | | not null |
ownerId | uuid | | |
content | text | | |
title | text | | |
createdAt | timestamp with time zone | | |
updatedAt | timestamp with time zone | | |
shortid | character varying(255) | | not null | '0000000000'::character varying
permission | "enum_Notes_permission" | | |
viewcount | integer | | | 0
lastchangeuserId | uuid | | |
lastchangeAt | timestamp with time zone | | |
alias | character varying(255) | | |
savedAt | timestamp with time zone | | |
authorship | text | | |
deletedAt | timestamp with time zone | | |
Indexes:
"Notes_pkey" PRIMARY KEY, btree (id)
"notes_alias" btree (alias)
"notes_shortid" btree (shortid)
役に立ちそうな列のみSELECTしてみた
codimd=# SELECT "shortid", "title", "permission", "lastchangeAt" FROM "Notes";
shortid | title | permission | lastchangeAt
-----------+----------------------------------------------------+------------+----------------------------
_A19KHYoN | test_page_2 | private | 2023-05-08 07:32:14.243+00
Qtw_cm7UW | Shared Test | limited | 2023-05-15 07:58:42.236+00
3Tfsa4vhL | Features | locked | 2022-08-06 16:07:30+00
iBDlPwhAI | Release Notes | locked | 2022-08-06 16:07:30+00
ED9_X46ph | Untitled | private |
CPQR0ADM9 | Knowleage BaseのIDEA Note | limited | 2023-05-09 04:36:39.612+00
LRVy1Eq51 | test | private | 2023-05-08 06:55:46.991+00
dI53VLZ0y | Untitled | private |
KgPVJ8-8s | CodiMDを検索しよう | limited | 2023-05-18 06:35:52.009+00
HkALXv762 | 2023-05-11 SIPミーティング@NEE | private | 2023-05-15 01:45:18.88+00
DOfMgd-uL | Untitled | private |
_3LAOrc2k | 2023-05-15 llama-index trial memo | limited | 2023-05-16 04:31:18.815+00
xsf-jNPA4 | R&D misc | limited | 2023-05-18 06:03:25.147+00
n3y7LK2UB | Untitled | private |
xqsPz4bKu | WhisperAppDev | limited | 2023-05-18 06:02:59.128+00
EemmfphMF | 2023-05-16 メトロAI MTG | limited | 2023-05-17 04:01:14.208+00
MdR4jT_J4 | SIPスマートインフラ 戦略及び計画(PDF:1407KB)PDF | private | 2023-05-16 01:26:37.676+00
(17 rows)
codimd=# SELECT "id", "title", "permission", "lastchangeAt" FROM "Notes";
id | title | permission | lastchangeAt
--------------------------------------+----------------------------------------------------+------------+----------------------------
556bd1d3-d02b-4af9-ab0a-c940f06feca1 | test_page_2 | private | 2023-05-08 07:32:14.243+00
5197d13d-8145-4257-b7a4-996010c7512c | Shared Test | limited | 2023-05-15 07:58:42.236+00
7696a195-84cc-47d4-9707-b7e6e55cb3ab | Features | locked | 2022-08-06 16:07:30+00
8d29fefa-db42-452d-9480-78b26e30b486 | Release Notes | locked | 2022-08-06 16:07:30+00
860415b5-31ac-4b5b-b3f3-1e7be25c9ddf | Untitled | private |
2c6ed94d-9b1e-431c-ae1e-3660241d6435 | Knowleage BaseのIDEA Note | limited | 2023-05-09 04:36:39.612+00
06773d6f-7b11-4667-b1cc-c0cf04ae09a9 | test | private | 2023-05-08 06:55:46.991+00
c8c621eb-fc23-4194-bf57-ba3b64f105dc | Untitled | private |
9b67d1eb-7a3d-4a17-a472-2cade3783e2a | CodiMDを検索しよう | limited | 2023-05-18 06:45:52.233+00
f2a22a98-b55a-4d3c-9f92-3b5f78c4e15b | 2023-05-11 SIPミーティング@NEE | private | 2023-05-15 01:45:18.88+00
4a261c81-de69-46b4-af50-ea9aa063eee3 | Untitled | private |
0d64b668-d8fb-4a1f-9b83-b5e274cf78ee | 2023-05-15 llama-index trial memo | limited | 2023-05-16 04:31:18.815+00
ff8dab54-faee-409c-9113-f68fb441cb39 | R&D misc | limited | 2023-05-18 06:52:28.391+00
b2667075-e7a2-43c4-8453-f77e730332ee | Untitled | private |
06067e1c-5d3f-4570-80c4-b220aa1b2edb | WhisperAppDev | limited | 2023-05-18 06:02:59.128+00
8ce7cf2f-4f3b-471f-9ae1-fac296336c30 | 2023-05-16 メトロAI MTG | limited | 2023-05-17 04:01:14.208+00
4ec712c4-113a-4530-aa44-b36b15b6d111 | SIPスマートインフラ 戦略及び計画(PDF:1407KB)PDF | private | 2023-05-16 01:26:37.676+00
(17 rows)
idからURLを生成する方法
例:Shared Testの例
import uuid
import base64
my_uuid = new_occ_id = base64.urlsafe_b64encode(uuid.UUID('5197d13d-8145-4257-b7a4-996010c7512c').bytes)
my_uuid_22 = new_occ_id = my_uuid[0:22]
print(my_uuid)
print(my_uuid_22)
# ==> UZfRPYFFQle3pJlgEMdRLA
Note: BaseURL + shortidでアクセスしたらリダイレクトされないですかね?