中文搜索（pg_jieba）--云数据库 PostgreSQL 版-火山引擎

文档中心

立即注册

导航

中文搜索（pg_jieba）

最近更新时间：2024.12.12 15:21:45首次发布时间：2024.12.12 15:21:45

pg_jieba 是用于中文分词全文搜索的 PostgreSQL 插件。

注意事项

pg_jieba 除提供基本的分词功能外，也提供了定制分词的能力。使用定制分词的能力前需要将 pg_jieba 加入 shared_preload_libraries 参数。关于修改参数的详细信息，请参见修改参数。

说明修改该参数需要重启实例以使修改生效，建议在修改参数前提前做好业务规划和影响评估。

创建与删除插件

创建插件
```
create extension pg_jieba;
```
删除插件
```
drop extension pg_jieba;
```

使用插件

基本使用

postgres=# select * from to_tsquery('jiebacfg', '是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰。');
                                                            to_tsquery                                                            
----------------------------------------------------------------------------------------------------------------------------------
 '拖拉机' & '学院' & '手扶拖拉机' & '专业' & '不用' & '多久' & '会' & '升职' & '加薪' & '当上' & 'ceo' & '走上' & '人生' & '巅峰'
(1 row)

postgres=# select * from to_tsvector('jiebacfg', '是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰。');
                                                                to_tsvector                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 'ceo':18 '不用':8 '专业':5 '人生':21 '会':13 '加薪':15 '升职':14 '多久':9 '学院':3 '巅峰':22 '当上':17 '手扶拖拉机':4 '拖拉机':2 '走上':20
(1 row)

-- Token And Tag
postgres=# select * from ts_token_type('jieba');
 tokid | alias |         description         
-------+-------+-----------------------------
     1 | eng   | letter
     2 | nz    | other proper noun
     3 | n     | noun
     4 | m     | numeral
     5 | i     | idiom
     6 | l     | temporary idiom
     7 | d     | adverb
     8 | s     | space
     9 | t     | time
    10 | mq    | numeral-classifier compound
    11 | nr    | person's name
    12 | j     | abbreviate
    13 | a     | adjective
    14 | r     | pronoun
    15 | b     | difference
    16 | f     | direction noun
    17 | nrt   | nrt
    18 | v     | verb
    19 | z     | z
    20 | ns    | location
    21 | q     | quantity
    22 | vn    | vn
    23 | c     | conjunction
    24 | nt    | organization
    25 | u     | auxiliary
    26 | o     | onomatopoeia
    27 | zg    | zg
    28 | nrfg  | nrfg
    29 | df    | df
    30 | p     | prepositional
    31 | g     | morpheme
    32 | y     | modal verbs
    33 | ad    | ad
    34 | vg    | vg
    35 | ng    | ng
    36 | x     | unknown
    37 | ul    | ul
    38 | k     | k
    39 | ag    | ag
    40 | dg    | dg
    41 | rr    | rr
    42 | rg    | rg
    43 | an    | an
    44 | vq    | vq
    45 | e     | exclamation
    46 | uv    | uv
    47 | tg    | tg
    48 | mg    | mg
    49 | ud    | ud
    50 | vi    | vi
    51 | vd    | vd
    52 | uj    | uj
    53 | uz    | uz
    54 | h     | h
    55 | ug    | ug
    56 | rz    | rz
(56 rows)

进阶使用——定制分词效果

完成前提条件后，可得下面的参数生效，这些配置选项可以帮助定制分词效果，以适应特定的业务需求。

pg_jieba.hmm_model -- HMM Model file.
pg_jieba.base_dict -- Base dictionary.
pg_jieba.user_dict -- csv list of specific user dictionary name(Exclude suffix .dict). All should located in dir tsearch_data.

其他

更多信息，请参见 pg_jieba 官方文档。