大模型分析用户原声数据
On this page
背景:
- 前期统计用户淘宝客户端舆情反馈2023年1月份和2月份数据共1.3w条,平均每天217条舆情数据。
- 用户舆情体现的是用户真实的产品使用体验问题,而且用户的影响程度比较严重
问题:
- 没有现有工具能定期跟进分析业务域的舆情数据
怎么做:
- 采用算法对舆情数据分析处理,能够自动对舆情问题分类,并有合理的数据度量
- 对不同领域的问题,形成产品合作机制,推进产品解决,沉淀解决方案
- 长期自动化跟进数据指标,形成业务预警能力
数据
再补充一个数据的问题,用户反馈的数据真实场景是比较稀疏的, 人工可以推断大致的分类,但是还是要借助业务知识和上下文,比如部分数据如下:
怎么吃
早上吃还是晚上吃
吃几个
好的,晚上吃吗
中午用不用吃
饭前还是饭后
睡觉前吃
一次吃几粒
早上吃吗
喝酒可以吃吗
这个业务场景是用户在淘宝上购药和医生沟通时反馈的问题,可以发现用户主要问题的是药品用法用量问题,难点是:
- 用户反馈内容短,数据少,缺少上下文
- 反馈内筒稀疏,几百条数据散落在几万个反馈中
大模型聚类
基于prompt让大模型理解语义,进行数据聚类;比如用下面的Prompt
下面是用户在购买药品时收集的用户反馈问题,请结合场景分析文本内容,对所有数据合并分类,最后用JSON格式输出分类结果,例如{'类别1':['反馈1','反馈2']}
数据如下
···
饭前还是饭后
睡觉前吃
一次吃几粒
···
最后GPT是能给出答案的,之前用gpt3.5测试,结果也挺好的。
根据提供的用户反馈问题,我们可以将它们分类为以下几个类别:
用药时间:用户询问关于何时服用药品的问题,比如饭前还是饭后,以及睡觉前。
用药剂量:用户询问关于药品的服用量,比如一次吃几粒。
下面是按照这些类别合并分类后的JSON格式输出:
{
"用药时间": [
"饭前还是饭后",
"睡觉前吃"
],
"用药剂量": [
"一次吃几粒"
]
}
测一下还可以,接下来问题出来了:
- 几万条数据怎么处理?
- 输出最多2048个token,输入输出都要控制
- 不是每次都能正确输出内容,要多刷几次(3/10)的概率
- 多次输入,输出结果不一致
- 归类的类别不确定,比如出现"用药时间",“何时用药"等重复的类别
数据长的问题可以用我们我们在LangChain之ReAct原理 课程中介绍过 refine document chain:
但是问题还是随之而来:
- 直接简单将文本分段经过大模型分类再合并,会出现归类不确定,重复类似类别的问题,调用频繁也会导致输出不稳定;
- 采用refine Document chain,会把上次归类的结果带到下次Prompt,结果是上下文越来越长(gpt3.5之前不能超过4096长度),导致模型输出劣化,越来越不准确;
transformer 聚类
直接用大模型的问题很多,我就不一一列举了,感兴趣的小伙伴可以尝试,现在大模型的能力也变强了,可以用kimi、gpt4o等模型试一下。这里我们介绍第二个方法直接用transformer聚类。
我们知道大模型底层用transformer架构,那么我们是不是可以借鉴大模型训练的方法,直接将文本数据进行聚类呢。
答案是肯定的,经过各种翻资料、调研数据,终于染个我找到了sentence transformer
给一个聚类的算法例子:
pip install -U sentence-transformers
pip install -U transformers
Sentence-Transformer提供了非常多的预训练模型供我们使用,对于STS(Semantic Textual Similarity)任务来说,比较好的模型有以下几个
roberta-large-nli-stsb-mean-tokens - STSb performance: 86.39
roberta-base-nli-stsb-mean-tokens - STSb performance: 85.44
bert-large-nli-stsb-mean-tokens - STSb performance: 85.29
distilbert-base-nli-stsb-mean-tokens - STSb performance: 85.16
"""
This is a more complex example on performing clustering on large scale dataset.
This examples find in a large set of sentences local communities, i.e., groups of sentences that are highly
similar. You can freely configure the threshold what is considered as similar. A high threshold will
only find extremely similar sentences, a lower threshold will find more sentence that are less similar.
A second parameter is 'min_community_size': Only communities with at least a certain number of sentences will be returned.
The method for finding the communities is extremely fast, for clustering 50k sentences it requires only 5 seconds (plus embedding comuptation).
In this example, we download a large set of questions from Quora and then find similar questions in this set.
"""
from sentence_transformers import SentenceTransformer, util
import os
import csv
import time
import pickle
# Model for computing sentence embeddings. We use one trained for similar questions detection
model_name = "/Users/chalecao/Downloads/model/paraphrase-multilingual-MiniLM-L12-v2"
model = SentenceTransformer(model_name)
# We donwload the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
dataset_path = "userFeedBack-202307-clean.csv"
max_corpus_size = 100000 # We limit our corpus to only the first 50k questions
embedding_cache_path = 'embeddings-cache-{}-size-{}.pkl'.format('userFeedBack', max_corpus_size)
#Check if embedding cache path exists
if not os.path.exists(embedding_cache_path):
# Get all unique sentences from the file
corpus_sentences = set()
# 如果用utf-8编码读出来的文件头有\ufeff ,需要采用utf-8-sig编码可以解决
with open(dataset_path, encoding='utf-8-sig') as fIn:
reader = csv.DictReader(fIn)
for row in reader:
# print(row)
corpus_sentences.add(row['content'])
if len(corpus_sentences) >= max_corpus_size:
break
corpus_sentences = list(corpus_sentences)
print("Encode the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=True)
print("Store file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
else:
print("Load pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']
print("Start clustering")
start_time = time.time()
# community_detection是一个实现快速社群检测的函数。给定一个嵌入向量(embeddings),该函数会找出其中所有的社群,即距离(余弦相似度)小于阈值(threshold)的嵌入向量。函数返回的社群中的嵌入向量数量必须大于等于最小社群大小(min_community_size)。返回的社群列表按照社群大小降序排列,每个列表的第一个元素是社群中的中心点。
# Two parameters to tune:
#min_cluster_size: Only consider cluster that have at least 25 elements
#threshold: Consider sentence pairs with a cosine-similarity larger than threshold as similar
clusters = util.community_detection(corpus_embeddings, min_community_size=25, threshold=0.75)
print("Clustering done after {:.2f} sec".format(time.time() - start_time))
#Print for all clusters the top 3 and bottom 3 elements
# for i, cluster in enumerate(clusters):
# print("\nCluster {}, #{} Elements ".format(i+1, len(cluster)))
# for sentence_id in cluster[0:3]:
# print("\t", corpus_sentences[sentence_id])
# print("\t", "...")
# for sentence_id in cluster[-3:]:
# print("\t", corpus_sentences[sentence_id])
def save_clusters_to_csv(clusters, corpus_sentences, filename):
with open(filename, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["cluster", "content"])
for i, cluster in enumerate(clusters):
for sentence_id in cluster:
writer.writerow(["cluster-{}-{}".format(i+1, len(cluster)), corpus_sentences[sentence_id]])
cluster_path = "userFeedBack-202307-cluster.csv"
save_clusters_to_csv(clusters, corpus_sentences, cluster_path)
当然,SentenceTransformer也是支持Fine-Tune的,可以再官方模型上进行自定义数据集Fine-Tune,具体怎么操作可以看下官方文档和示例,我后面在介绍哈,可以右下角关注我。