--- title: Podcast Search emoji: 🚀 colorFrom: green colorTo: gray sdk: streamlit sdk_version: 1.41.1 app_file: src/app.py pinned: false license: mit short_description: terapyon channel の検索 --- # podcast-search Podcast terapyon channelを検索する仕組み ## 使い方 ### タイトルリスト - 以下のファイルを`store` フォルダに置く - `title-list-202301-202501.parquet` - 以下のカラムを持つ - id: int - date: str (2023-01-09) - length: int - audio: str (オーディオファイルURL) - title: str タイトルリストファイルの例

	id	date	length	audio	title
0	69	2023-01-09	20993616	https://anchor.fm/s/14480e04/podcast/play/6323...	#69 2023年新年挨拶から 2022年の振り返りと2023年の抱負
1	70	2023-03-09	103287296	https://anchor.fm/s/14480e04/podcast/play/6621...	#70 PyCon JP Association代表理事退任と今後の展望をIqbalさんと語る
2	71	2023-03-22	116393694	https://anchor.fm/s/14480e04/podcast/play/6706...	#71 hirokikyさんをゲストに自然言語処理系AI Chat GPT / Whisp...
3	72	2023-05-04	49642320	https://anchor.fm/s/14480e04/podcast/play/6976...	#72 PyCon US 2023 ひとり振り返り
4	73	2023-05-24	150643013	https://anchor.fm/s/14480e04/podcast/play/7094...	#73 Nyohoさんをゲストに Scratchからディープラーニングや数学の話

### 文字データ作成 - dataフォルダをを作る(srcと同じ階層) - dataフォルダに、srtファイルを入れる - (以下に従うと、srtファイルからIDが取得できる) - 拡張子を `.srt` とする - ファイル名に、ID(整数)が1つだけ入ってること - IDの前後に、 `-` または `_` で区切られいること - 以下のスクリプトを実行する。 `store` フォルダに `parquet` ファイルが srtファイル分できる ``` % python src/episode.py ``` ### データベース作成以下のコマンドで、テーブル作成から必要な3つのデータをDuckDB(永続化)を作る ``` % python src/store.py all ``` 上記のコマンドの詳細 - テーブル作成 create table - `python src/store.py create` - タイトルリスト insert - `python src/store.py podcastinsert` - エピソードとテキスト insert - `python src/store.py episodeinsert` - ベクトル化 embedding - `python src/store.py embed` - ベクトルデータ index - `python src/store.py index` ### 検索UI ``` % streamlit run src/app.py ``` - Podcastタイトル(複数)を選ぶ。未選択の場合すべてとなる - 検索したいワードをテキストボックスに入力 - 10個のセンテンス(文章)候補が出てくる - 表の左をクリックすると、下部に文字列が表示される - 音声のタイミング（分・秒）が表示される - そのタイミングの音声がその場で聞ける https://github.com/user-attachments/assets/98e85be4-a633-4bdc-900d-9a7c06818d9b