Skip to content

Commit d0041a9

Browse files
authored
Optimize the file storage structure of the knowledge base (#386)
1 parent 29d1527 commit d0041a9

File tree

41 files changed

+591
-231
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+591
-231
lines changed

.vscode/settings.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"editor.formatOnSave": true,
33
"editor.mouseWheelZoom": true,
44
"typescript.tsdk": "node_modules/typescript/lib",
5-
"editor.defaultFormatter": "esbenp.prettier-vscode",
5+
"prettier.prettierPath": "./node_modules/prettier",
66
"i18n-ally.localesPaths": [
77
"projects/app/public/locales"
88
],

docSite/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## 本地运行
44

55
1. 安装 go 语言环境。
6-
2. 安装 hugo。 [二进制下载](https://github.com/gohugoio/hugo/releases/tag/v0.117.0)
6+
2. 安装 hugo。 [二进制下载](https://github.com/gohugoio/hugo/releases/tag/v0.117.0),注意需要安装 extended 版本。
77
3. cd docSite
88
4. hugo serve
99
5. 访问 http://localhost:1313

docSite/content/docs/development/configuration.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,14 @@ weight: 520
8484
"maxToken": 16000,
8585
"price": 0,
8686
"prompt": ""
87+
},
88+
"QGModel": { // 生成下一步指引模型
89+
"model": "gpt-3.5-turbo",
90+
"name": "GPT35-4k",
91+
"maxToken": 4000,
92+
"price": 0,
93+
"prompt": "",
94+
"functionCall": false
8795
}
8896
}
8997
```
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
weight: 540
3+
title: "设计方案"
4+
description: "FastGPT 部分设计方案"
5+
icon: public
6+
draft: false
7+
images: []
8+
---
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
weight: 541
3+
title: "数据集"
4+
description: "FastGPT 数据集中文件与数据的设计方案"
5+
icon: dataset
6+
draft: false
7+
images: []
8+
---
9+
10+
## 文件与数据的关系
11+
12+
在 FastGPT 中,文件会通过 MongoDB 的 FS 存储,而具体的数据会通过 PostgreSQL 存储,PG 中的数据会有一列 file_id,关联对应的文件。考虑到旧版本的兼容,以及手动输入、标注数据等,我们给 file_id 增加了一些特殊的值,如下:
13+
14+
- manual: 手动输入
15+
- mark: 手动标注的数据
16+
17+
注意,file_id 仅在插入数据时会写入,变更时无法修改。
18+
19+
## 文件导入流程
20+
21+
1. 上传文件到 MongoDB 的 FS 中,获取 file_id,此时文件标记为 `unused` 状态
22+
2. 浏览器解析文件,获取对应的文本和 chunk
23+
3. 给每个 chunk 打上 file_id
24+
4. 点击上传数据:将文件的状态改为 `used`,并将数据推送到 mongo `training` 表中等待训练
25+
5. 由训练线程从 mongo 中取数据,并在获取向量后插入到 pg。
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: 'V4.4.7'
3+
description: 'FastGPT V4.4.7 更新(需执行升级脚本)'
4+
icon: 'upgrade'
5+
draft: false
6+
toc: true
7+
weight: 840
8+
---
9+
10+
## 执行初始化 API
11+
12+
发起 1 个 HTTP 请求({{rootkey}} 替换成环境变量里的`rootkey`,{{host}}替换成自己域名)
13+
14+
1. https://xxxxx/api/admin/initv445
15+
16+
```bash
17+
curl --location --request POST 'https://{{host}}/api/admin/initv447' \
18+
--header 'rootkey: {{rootkey}}' \
19+
--header 'Content-Type: application/json'
20+
```
21+
22+
初始化 pg 索引以及将 file_id 中空对象转成 manual 对象。如果数据多,可能需要较长时间,可以通过日志查看进度。
23+
24+
## 功能介绍
25+
26+
### Fast GPT V4.4.7
27+
28+
1. 优化了数据库文件 crud。
29+
2. 兼容链接读取,作为 source。

packages/common/tools/file.ts

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import { strIsLink } from './str';
2+
3+
export const fileImgs = [
4+
{ suffix: 'pdf', src: '/imgs/files/pdf.svg' },
5+
{ suffix: 'csv', src: '/imgs/files/csv.svg' },
6+
{ suffix: '(doc|docs)', src: '/imgs/files/doc.svg' },
7+
{ suffix: 'txt', src: '/imgs/files/txt.svg' },
8+
{ suffix: 'md', src: '/imgs/files/markdown.svg' },
9+
{ suffix: '.', src: '/imgs/files/file.svg' }
10+
];
11+
12+
export function getFileIcon(name = '') {
13+
return fileImgs.find((item) => new RegExp(item.suffix, 'gi').test(name))?.src;
14+
}
15+
export function getSpecialFileIcon(name = '') {
16+
if (name === 'manual') {
17+
return '/imgs/files/manual.svg';
18+
} else if (name === 'mark') {
19+
return '/imgs/files/mark.svg';
20+
} else if (strIsLink(name)) {
21+
return '/imgs/files/link.svg';
22+
}
23+
}

packages/common/tools/str.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
export function strIsLink(str?: string) {
2+
if (!str) return false;
3+
if (/^((http|https)?:\/\/|www\.|\/)[^\s/$.?#].[^\s]*$/i.test(str)) return true;
4+
return false;
5+
}

packages/core/dataset/constant.ts

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
export enum DatasetSpecialIdEnum {
2+
manual = 'manual',
3+
mark = 'mark'
4+
}
5+
export const datasetSpecialIdMap = {
6+
[DatasetSpecialIdEnum.manual]: {
7+
name: 'kb.Manual Data',
8+
sourceName: 'kb.Manual Input'
9+
},
10+
[DatasetSpecialIdEnum.mark]: {
11+
name: 'kb.Mark Data',
12+
sourceName: 'kb.Manual Mark'
13+
}
14+
};
15+
export const datasetSpecialIds: string[] = [DatasetSpecialIdEnum.manual, DatasetSpecialIdEnum.mark];

packages/core/dataset/utils.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
import { datasetSpecialIds } from './constant';
2+
import { strIsLink } from '@fastgpt/common/tools/str';
3+
4+
export function isSpecialFileId(id: string) {
5+
if (datasetSpecialIds.includes(id)) return true;
6+
if (strIsLink(id)) return true;
7+
return false;
8+
}

0 commit comments

Comments
 (0)