Skip to content

CodiMDを検索しよう

作る方針

  • pgsqlから中身は取得
  • pgSQLを叩いて、ノートURL、コンテンツテキストを取得する
    • "Notes"テーブルのcontentが中身、short_idがURL関係の参照用のID
  • テキストは適当にJSON化してmeilisearchに投げる
    • とりあえず雑にcontentの改行を除去して投入
  • URL('url')はbaseurl+short_idを入れて投げる
  • idはそのまま
  • datetime.isoformatな日付は、timestamp形式にして投げる。

関連スクリプト

pgsqlの"Notes"テーブルの中身をJSONでdumpするコード

# access pgsql and get data to json file
import psycopg2
from psycopg2.extras import DictCursor
import json


def default(o):
    if hasattr(o, "isoformat"):
        return o.isoformat()
    else:
        return str(o)

def dump_data():
    with psycopg2.connect("host=localhost port=5432 dbname=codimd user=codimd password=6lCd8ftriT") as db:
        with db.cursor(cursor_factory=DictCursor) as cur:
            cur.execute('SELECT * FROM "Notes"')
            results = cur.fetchall()
            dict_result = []
            for row in results:
                dict_result.append(dict(row))
            with open("./codimd.dump.json", "w", encoding="utf-8") as f:
                json.dump(dict_result, f, default=default, indent=4)


def load_data_from_json(file):
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data

Meilisearchに投げるコード

codimd_data = load_data_from_json("./codimd.dump.json")
doc = codimd_data[0]
# by using strptime isofromat datetime from createdAt attr and replace createdAt to timestamp 
import datetime
created_at = datetime.datetime.fromisoformat(doc["createdAt"])
doc["createdAt"] = created_at.timestamp()
updated_at = datetime.datetime.fromisoformat(doc["updatedAt"])
doc["updatedAt"] = updated_at.timestamp()
lastchange_at = datetime.datetime.fromisoformat(doc["lastchangeAt"])
doc["lastchangeAt"] = lastchange_at.timestamp()
doc

import meilisearch
import json

client = meilisearch.Client('https://mir.labs.basisconsulting.co.jp:7701', "1f95f3ef46f7aa3ae543b9d2544ffbc719b4befc52c2c0a6371893351f906bca")
codimd_data = load_data_from_json("./codimd.dump.json")
# add url to codimd page
for doc in codimd_data:
    doc["url"] = "https://mir.labs.basisconsulting.co.jp:3301/" + doc["shortid"]
    # replace \n to space in content
    doc["content"] = doc["content"].replace("\n", " ")
    doc["content"] = doc["content"].replace("\r", " ")
    doc["content"] = doc["content"].replace("\t", " ")
    # replace created_at to timestamp
    created_at = datetime.datetime.fromisoformat(doc["createdAt"])
    doc["createdAt"] = created_at.timestamp()
    updated_at = datetime.datetime.fromisoformat(doc["updatedAt"])
    doc["updatedAt"] = updated_at.timestamp()
    if doc["lastchangeAt"] is not None:
        lastchange_at = datetime.datetime.fromisoformat(doc["lastchangeAt"])
        doc["lastchangeAt"] = lastchange_at.timestamp()

client.index('md2').add_documents(codimd_data,"id")

Meilisearchの検索の中身について

Index関係

LMDBを使っている。

Creating a database from scratch and managing it is hard work. It would make no sense to try and reinvent the wheel, so Meilisearch uses a storage engine under the hood. This allows the Meilisearch team to focus on improving search relevancy and search performance while abstracting away the complicated task of creating, reading and updating documents on disk and in memory.

Our storage engine Meilisearch is called Lightning Memory-Mapped Database (LMDB for short). LMDB is a transactional key-value store written in C that was developed for OpenLDAP and has ACID properties. Though we considered other options, such as Sled and RocksDB, we chose LMDB because it provided us with the best combination of performance, stability, and features.

https://www.meilisearch.com/docs/learn/advanced/storage#lmdb

Tokenizer

Lindera+IPAdicの模様。

A specialized Japanese pipeline using Lindera

https://www.meilisearch.com/docs/learn/advanced/tokenization

  • n-gramはなさそう

日本語対応の今後

Misskeyが本格的にMeilisearch使う過程で色々日本語周りにメスが入る可能性がありそう。

その他検討

markdownの中身はどこにあるのか

+------------------+
| Tables_in_hackmd |
+------------------+
| Authors          |
| Notes            |
| Revisions        |
| SequelizeMeta    |
| Sessions         |
| Users            |
+------------------+

Notesテーブルに入ってる模様。

use Notes;
SELECT title,shortid,alias FROM Notes;

hackmd-cli

> hackmd-cli history

 ID                     Name
 ────────────────────── ──────────────────────────────────────────────────
 UZfRPYFFQle3pJlgEMdRLA Shared Test
 LG7ZTZseQxyuHjZgJB1kNQ Knowleage BaseのIDEA Note
 m2fR63o9Shekciyt43g-Kg CodiMDを検索しよう
 features               Features
 8qIqmLVaTTyfkjtfeMThWw 2023-05-11 SIPミーティング@NEE
 DWS2aNj7Sh-bg7XidM947g 2023-05-15 llama-index trial memo
 jOfPL087Rx-a4frCljNsMA 2023-05-16 メトロAI MTG
 TscSxBE6RTCqRLNrFbbREQ SIPスマートインフラ 戦略及び計画(PDF:1407KB)PDF
 _42rVPruQJyRE_aPtEHLOQ R&D misc

hackmd-cli用にアカウントを作って、そのアカウント上でhackmd-cli historyをすると

 ID Name
 ── ────

となり、これは険しい。codimdはhackmd-cli list も使えず、URL全件取得はどうしたらええんや・・・

DBの中身を覗いてみる

Notesのスキーマ

codimd=# \d "Notes"
                                         Table "public.Notes"
      Column      |           Type           | Collation | Nullable |             Default
------------------+--------------------------+-----------+----------+---------------------------------
 id               | uuid                     |           | not null |
 ownerId          | uuid                     |           |          |
 content          | text                     |           |          |
 title            | text                     |           |          |
 createdAt        | timestamp with time zone |           |          |
 updatedAt        | timestamp with time zone |           |          |
 shortid          | character varying(255)   |           | not null | '0000000000'::character varying
 permission       | "enum_Notes_permission"  |           |          |
 viewcount        | integer                  |           |          | 0
 lastchangeuserId | uuid                     |           |          |
 lastchangeAt     | timestamp with time zone |           |          |
 alias            | character varying(255)   |           |          |
 savedAt          | timestamp with time zone |           |          |
 authorship       | text                     |           |          |
 deletedAt        | timestamp with time zone |           |          |
Indexes:
    "Notes_pkey" PRIMARY KEY, btree (id)
    "notes_alias" btree (alias)
    "notes_shortid" btree (shortid)

役に立ちそうな列のみSELECTしてみた

codimd=# SELECT "shortid", "title", "permission", "lastchangeAt" FROM "Notes";
  shortid  |                       title                        | permission |        lastchangeAt
-----------+----------------------------------------------------+------------+----------------------------
 _A19KHYoN | test_page_2                                        | private    | 2023-05-08 07:32:14.243+00
 Qtw_cm7UW | Shared Test                                        | limited    | 2023-05-15 07:58:42.236+00
 3Tfsa4vhL | Features                                           | locked     | 2022-08-06 16:07:30+00
 iBDlPwhAI | Release Notes                                      | locked     | 2022-08-06 16:07:30+00
 ED9_X46ph | Untitled                                           | private    |
 CPQR0ADM9 | Knowleage BaseのIDEA Note                          | limited    | 2023-05-09 04:36:39.612+00
 LRVy1Eq51 | test                                               | private    | 2023-05-08 06:55:46.991+00
 dI53VLZ0y | Untitled                                           | private    |
 KgPVJ8-8s | CodiMDを検索しよう                                 | limited    | 2023-05-18 06:35:52.009+00
 HkALXv762 | 2023-05-11 SIPミーティングNEE                   | private    | 2023-05-15 01:45:18.88+00
 DOfMgd-uL | Untitled                                           | private    |
 _3LAOrc2k | 2023-05-15 llama-index trial memo                  | limited    | 2023-05-16 04:31:18.815+00
 xsf-jNPA4 | R&D misc                                           | limited    | 2023-05-18 06:03:25.147+00
 n3y7LK2UB | Untitled                                           | private    |
 xqsPz4bKu | WhisperAppDev                                      | limited    | 2023-05-18 06:02:59.128+00
 EemmfphMF | 2023-05-16 メトロAI MTG                            | limited    | 2023-05-17 04:01:14.208+00
 MdR4jT_J4 | SIPスマートインフラ 戦略及び計画(PDF1407KBPDF | private    | 2023-05-16 01:26:37.676+00
(17 rows)
codimd=# SELECT "id", "title", "permission", "lastchangeAt" FROM "Notes";
                  id                  |                       title                        | permission |        lastchangeAt
--------------------------------------+----------------------------------------------------+------------+----------------------------
 556bd1d3-d02b-4af9-ab0a-c940f06feca1 | test_page_2                                        | private    | 2023-05-08 07:32:14.243+00
 5197d13d-8145-4257-b7a4-996010c7512c | Shared Test                                        | limited    | 2023-05-15 07:58:42.236+00
 7696a195-84cc-47d4-9707-b7e6e55cb3ab | Features                                           | locked     | 2022-08-06 16:07:30+00
 8d29fefa-db42-452d-9480-78b26e30b486 | Release Notes                                      | locked     | 2022-08-06 16:07:30+00
 860415b5-31ac-4b5b-b3f3-1e7be25c9ddf | Untitled                                           | private    |
 2c6ed94d-9b1e-431c-ae1e-3660241d6435 | Knowleage BaseのIDEA Note                          | limited    | 2023-05-09 04:36:39.612+00
 06773d6f-7b11-4667-b1cc-c0cf04ae09a9 | test                                               | private    | 2023-05-08 06:55:46.991+00
 c8c621eb-fc23-4194-bf57-ba3b64f105dc | Untitled                                           | private    |
 9b67d1eb-7a3d-4a17-a472-2cade3783e2a | CodiMDを検索しよう                                 | limited    | 2023-05-18 06:45:52.233+00
 f2a22a98-b55a-4d3c-9f92-3b5f78c4e15b | 2023-05-11 SIPミーティングNEE                   | private    | 2023-05-15 01:45:18.88+00
 4a261c81-de69-46b4-af50-ea9aa063eee3 | Untitled                                           | private    |
 0d64b668-d8fb-4a1f-9b83-b5e274cf78ee | 2023-05-15 llama-index trial memo                  | limited    | 2023-05-16 04:31:18.815+00
 ff8dab54-faee-409c-9113-f68fb441cb39 | R&D misc                                           | limited    | 2023-05-18 06:52:28.391+00
 b2667075-e7a2-43c4-8453-f77e730332ee | Untitled                                           | private    |
 06067e1c-5d3f-4570-80c4-b220aa1b2edb | WhisperAppDev                                      | limited    | 2023-05-18 06:02:59.128+00
 8ce7cf2f-4f3b-471f-9ae1-fac296336c30 | 2023-05-16 メトロAI MTG                            | limited    | 2023-05-17 04:01:14.208+00
 4ec712c4-113a-4530-aa44-b36b15b6d111 | SIPスマートインフラ 戦略及び計画(PDF1407KBPDF | private    | 2023-05-16 01:26:37.676+00
(17 rows)

idからURLを生成する方法

例:Shared Testの例

import uuid
import base64

my_uuid = new_occ_id = base64.urlsafe_b64encode(uuid.UUID('5197d13d-8145-4257-b7a4-996010c7512c').bytes)
my_uuid_22 = new_occ_id = my_uuid[0:22]
print(my_uuid)
print(my_uuid_22)
# ==> UZfRPYFFQle3pJlgEMdRLA

Note: BaseURL + shortidでアクセスしたらリダイレクトされないですかね?