做视频网站盈利多少,通用网站建设需求分析,技术支持东莞网站建设,做外贸一般看什么网站作者#xff1a;来自 Elastic Fram Souza 本博客介绍了使用 RAG 和 Elasticsearch 实现语义代码查询的 GitHub Assistant#xff0c;提供对 GitHub 存储库的洞察#xff0c;并可扩展到 PR 反馈、问题处理和生产准备情况审查。 该项目允许你直接与 GitHub 存储库交互并利用语…作者来自 Elastic Fram Souza 本博客介绍了使用 RAG 和 Elasticsearch 实现语义代码查询的 GitHub Assistant提供对 GitHub 存储库的洞察并可扩展到 PR 反馈、问题处理和生产准备情况审查。 该项目允许你直接与 GitHub 存储库交互并利用语义搜索来了解代码库。你将学习如何询问有关存储库代码的具体问题并收到有意义的上下文感知响应。你可以在此处关注 GitHub 代码。 主要考虑因素
数据质量输出的好坏取决于输入 —— 确保数据干净且结构良好。数据块大小适当的数据分块对于实现最佳性能至关重要。性能评估定期评估基于 RAG 的应用程序的性能。 组件
Elasticsearch用作向量数据库可高效存储和检索嵌入。LlamaIndex由 LLM 提供支持的应用程序构建框架。OpenAI用于 LLM 和生成嵌入。 架构 数据摄入 - ingestion
该过程首先将 GitHub 存储库克隆到本地 /tmp 目录。然后使用 SimpleDirectoryReader 加载克隆的存储库进行索引根据文件类型将文档拆分为块使用 CodeSplitter 处理代码文件使用 JSON、Markdown 和 SentenceSplitter 处理其他格式请参阅
def parse_documents():owner os.getenv(GITHUB_OWNER)repo os.getenv(GITHUB_REPO)branch os.getenv(GITHUB_BRANCH)base_path os.getenv(BASE_PATH, /tmp) if not owner or not repo:raise ValueError(GITHUB_OWNER and GITHUB_REPO environment variables must be set.)local_repo_path clone_repository(owner, repo, branch, base_path)nodes []file_summary []ts_parser get_parser(typescript)py_parser get_parser(python)go_parser get_parser(go)js_parser get_parser(javascript)bash_parser get_parser(bash)yaml_parser get_parser(yaml)parsers_and_extensions [(SentenceSplitter(), [.md]),(CodeSplitter(languagepython, parserpy_parser), [.py, .ipynb]),(CodeSplitter(languagetypescript, parserts_parser), [.ts]),(CodeSplitter(languagego, parsergo_parser), [.go]),(CodeSplitter(languagejavascript, parserjs_parser), [.js]),(CodeSplitter(languagebash, parserbash_parser), [.bash, ,sh]),(CodeSplitter(languageyaml, parseryaml_parser), [.yaml, .yml]),(JSONNodeParser(), [.json]),]for parser, extensions in parsers_and_extensions:matching_files []for ext in extensions:matching_files.extend(glob.glob(f{local_repo_path}/**/*{ext}, recursiveTrue))if len(matching_files) 0:file_summary.append(fFound {len(matching_files)} {, .join(extensions)} files in the repository.)loader SimpleDirectoryReader(input_dirlocal_repo_path, required_extsextensions, recursiveTrue)docs loader.load_data()parsed_nodes parser.get_nodes_from_documents(docs)print_docs_and_nodes(docs, parsed_nodes)nodes.extend(parsed_nodes)else:file_summary.append(fNo {, .join(extensions)} files found in the repository.)collect_and_print_file_summary(file_summary)print(\n)return nodes
如果你想在此代码中添加更多支持语言只需将新的解析器和扩展添加到 parsers_and_extensions 列表中即可。解析节点后使用 text-embedding-3-large 模型生成嵌入并存储在 Elasticsearch 中。嵌入模型使用 Setting 包声明它是一个全局变量
Settings.embed_model OpenAIEmbedding(modeltext-embedding-3-large)
然后它会在主函数中作为 Ingest Pipeline 的一部分使用。由于它是一个全局变量因此在摄取过程中无需再次调用它 nodes parse_documents()es_vector_store get_es_vector_store()try:pipeline IngestionPipeline(vector_storees_vector_store,)pipeline.run(documentsnodes, show_progressTrue)
上面的代码块首先将文档解析为较小的块节点然后初始化与 Elasticsearch 的连接。使用指定的 Elasticsearch 向量存储创建 IngestionPipeline并执行管道以处理节点并将其嵌入存储在 Elasticsearch 中同时显示处理过程中的进度。此时我们应该在 Elasticsearch 中索引你的数据并生成和存储嵌入。以下是文档在 ESS 中的一个例子 _source: {content: ChangelogAll notable changes to this project will be documented in this file.**For detailed release notes, please refer to the [GitHub
releases](https://github.com/elastic/synthetics/releases) page.**,metadata: {file_path: /tmp/elastic/synthetics/CHANGELOG.md,file_name: CHANGELOG.md,file_size: 23162,creation_date: 2024-10-08,last_modified_date: 2024-10-08,_node_content: {id_: 2918efbb-b1aa-4afa-a505-d584e62d0d87, embedding: null, metadata: {file_path: /tmp/elastic/synthetics/CHANGELOG.md, file_name: CHANGELOG.md, file_size: 23162, creation_date: 2024-10-08, last_modified_date: 2024-10-08}, excluded_embed_metadata_keys: [file_name, file_type, file_size, creation_date, last_modified_date, last_accessed_date], excluded_llm_metadata_keys: [file_name, file_type, file_size, creation_date, last_modified_date, last_accessed_date], relationships: {1: {node_id: b0574471-c909-4fc8-ab82-2165c45ba72a, node_type: 4, metadata: {file_path: /tmp/elastic/synthetics/CHANGELOG.md, file_name: CHANGELOG.md, file_size: 23162, creation_date: 2024-10-08, last_modified_date: 2024-10-08}, hash: 58b8f33fdb38603530f1d06333a6d84614d21bb305a2aee4cb74f174fd5037aa, class_name: RelatedNodeInfo}}, text: , mimetype: text/plain, start_char_idx: 0, end_char_idx: 204, text_template: {metadata_str}\n\n{content}, metadata_template: {key}: {value}, metadata_seperator: \n, class_name: TextNode},_node_type: TextNode,document_id: b0574471-c909-4fc8-ab82-2165c45ba72a,doc_id: b0574471-c909-4fc8-ab82-2165c45ba72a,ref_doc_id: b0574471-c909-4fc8-ab82-2165c45ba72a},embeddings: []}} 查询 - query
一旦数据被索引你就可以查询 Elasticsearch 索引以询问有关代码库的问题。query.py 脚本允许你与索引数据进行交互并询问有关代码库的问题。它从用户那里检索查询输入使用与 index.py 中使用的相同 OpenAIEmbedding 模型创建嵌入并使用从 Elasticsearch 向量存储加载的 VectorStoreIndex 设置查询引擎。查询引擎使用相似性搜索根据查询与存储的嵌入的相似性检索前 3 个最相关的文档。使用 response_modetree_summarize 以树状格式汇总结果你可以在下面看到代码片段 query input(Please enter your query: )openai_llm OpenAI(modelgpt-4o)es_vector_store get_es_vector_store()index VectorStoreIndex.from_vector_store(es_vector_store)try:query_engine index.as_query_engine(llmopenai_llm,similarity_top_k3,streamingFalse, response_modetree_summarize)bundle QueryBundle(query, embeddingembed_model.get_query_embedding(query))result query_engine.query(bundle)return result.response 安装 1. 克隆存储库
git clone https://github.com/framsouza/github-assistant.git
cd github-assistant 2. 安装所需的库
pip install -r requirements.txt 3. 设置环境变量
使用你的 Elasticsearch 凭据和目标 GitHub 存储库详细信息例如 GITHUB_TOKEN、GITHUB_OWNER、GITHUB_REPO、GITHUB_BRANCH、ELASTIC_CLOUD_ID、ELASTIC_USER、ELASTIC_PASSWORD、ELASTIC_INDEX更新 .env 文件。
以下是 .env 文件的一个示例
GITHUB_TOKEN
GITHUB_OWNER
GITHUB_REPO
GITHUB_BRANCH
ELASTIC_CLOUD_ID
ELASTIC_USER
ELASTIC_PASSWORD
ELASTIC_INDEX
OPENAI_API_KEY 使用方法 1. 通过运行以下命令索引你的数据并创建嵌入
python index.py 2. 通过运行以下命令询问有关代码库的问题
python query.py 例子
python query.py
Please enter your query: Give me a detailed list of the external dependencies being used in this repositoryBased on the provided context, the following is a list of third-party dependencies used in the given Elastic Cloud on K8s project:
1. dario.cat/mergo (BSD-3-Clause, v1.0.0)
2. Masterminds/sprig (MIT, v3.2.3)
3. Masterminds/semver (MIT, v4.0.0)
4. go-spew (ISC, v1.1.2-0.20180830191138-d8f796af33cc)
5. elastic/go-ucfg (Apache-2.0, v0.8.8)
6. ghodss/yaml (MIT, v1.0.0)
7. go-logr/logr (Apache-2.0, v1.4.1)
8. go-test/deep (MIT, v1.1.0)
9. gobuffalo/flect (MIT, v1.0.2)
10. google/go-cmp (BSD-3-Clause, v0.6.0)
...
This list includes both direct and indirect dependencies as identified in the context.None 你可能想问的问题
Give me a detailed description of what are the main functionalities implemented in the code? - 请详细描述一下代码中实现的主要功能是什么How does the code handle errors and exceptions? - 代码如何处理错误和异常Could you evaluate the test coverage of this codebase and also provide detailed insights into potential enhancements to improve test coverage significantly? - 你能否评估此代码库的测试覆盖率并提供有关潜在增强功能的详细见解以显著提高测试覆盖率 评估
evaluation.py 代码处理文档根据内容生成评估问题然后使用 LLM 评估响应的相关性响应是否与问题相关和忠实度响应是否忠实于源内容。以下是有关如何使用代码的分步指南
python evaluation.py --num_documents 5 --skip_documents 2 --num_questions 3 --skip_questions 1 --process_last_questions
你可以在不使用任何参数的情况下运行代码但上面的示例演示了如何使用参数。以下是每个参数的作用的详细说明 文档处理
--num_documents 5脚本将总共处理 5 个文档。--skip_documents 2将跳过前 2 个文档脚本将从第 3 个文档开始处理。因此它将处理文档 3、4、5、6 和 7。 问题生成
加载文档后脚本将根据这些文档的内容生成问题列表。
--num_questions 3在生成的问题中仅处理 3 个问题。--skip_questions 1脚本将跳过列表中的第一个问题并从第二个问题开始处理问题。--process_last_questions脚本将跳过第一个问题后处理前 3 个问题而是处理列表中的后 3 个问题。
Number of documents loaded: 5
\All available questions generated:
0. What is the purpose of chunking monitors in the updated push command as mentioned in the changelog?
1. How does the changelog describe the improvement made to the performance of the push command?
2. What new feature is added to the synthetics project when it is created via the init command?
3. According to the changelog, what is the file size of the CHANGELOG.md document?
4. On what date was the CHANGELOG.md file last modified?
5. What is the significance of the example lightweight monitor yaml file mentioned in the changelog?
6. How might the changes described in the changelog impact the workflow of users creating or updating monitors?
7. What is the file path where the CHANGELOG.md document is located?
8. Can you identify the issue numbers associated with the changes mentioned in the changelog?
9. What is the creation date of the CHANGELOG.md file as per the context information?
10. What type of file is the document described in the context information?
11. On what date was the CHANGELOG.md file last modified?
12. What is the file size of the CHANGELOG.md document?
13. Identify one of the bug fixes mentioned in the CHANGELOG.md file.
14. What command is referenced in the context of creating new synthetics projects?
15. How does the CHANGELOG.md file address the issue of varying NDJSON chunked response sizes?
16. What is the significance of the number #680 in the context of the document?
17. What problem is addressed by skipping the addition of empty values for locations?
18. How many bug fixes are explicitly mentioned in the provided context?
19. What is the file path of the CHANGELOG.md document?
20. What is the file path of the document being referenced in the context information?
...Generated questions:
1. What command is referenced in relation to the bug fix in the CHANGELOG.md?
2. On what date was the CHANGELOG.md file created?
3. What is the primary purpose of the document based on the context provided?Total number of questions generated: 3Processing Question 1 of 3:Evaluation Result:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Query | Response | Source | Relevancy Response | Relevancy Feedback | Relevancy Score | Faith Response | Faith Feedback || What command is referenced in relation to the bug | The init command is referenced in relation to | Bug Fixes | Pass | YES | 1 | Pass | YES |
| fix in the CHANGELOG.md? | the bug fix in the CHANGELOG.md. | | | | | | |
| | | | | | | | |
| | | - Pick the correct loader when bundling TypeScript | | | | | |
| | | or JavaScript journey files | | | | | |
| | | | | | | | |
| | | during push command #626 | | | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Processing Question 2 of 3:Evaluation Result:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Query | Response | Source | Relevancy Response | Relevancy Feedback | Relevancy Score | Faith Response | Faith Feedback || On what date was the CHANGELOG.md file created? | The date mentioned in the CHANGELOG.md file is | v1.0.0-beta-38 (20222-11-02) | Pass | YES | 1 | Pass | YES |
| | November 2, 2022. | | | | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Processing Question 3 of 3:Evaluation Result:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Query | Response | Source | Relevancy Response | Relevancy Feedback | Relevancy Score | Faith Response | Faith Feedback || What is the primary purpose of the document based | The primary purpose of the document is to provide | v1.0.0-beta-38 (20222-11-02) | Pass | YES | 1 | Pass | YES |
| on the context provided? | a changelog detailing the features and | | | | | | |
| | improvements made in version 1.0.0-beta-38 of a | | | | | | |
| | software project. It highlights specific | | | | | | |
| | enhancements such as improved validation for | | | | | | |
| | monitor schedules and an enhanced push command | | | | | | |
| | experience. | | | | | | |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(clean_env) (base) framsouzaFrams-MacBook-Pro-2 git-assistant %
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Processing Question 3 of 3:Evaluation Result:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Query | Response | Source | Relevancy Response | Relevancy Feedback | Relevancy Score | Faith Response | Faith Feedback |Response | Faith Feedback || What is the primary purpose of the document based | The primary purpose of the document is to provide | v1.0.0-beta-38 (20222-11-02) | Pass | YES | 1 | Pass | YES | | YES |
| on the context provided? | a changelog detailing the features and | | | | | | | | |
| | improvements made in version 1.0.0-beta-38 of a | | | | | | | | |
| | software project. It highlights specific | | | | | | | | |
| | enhancements such as improved validation for | | | | | | | | |
| | monitor schedules and an enhanced push command | | | | | | | | |
| | experience. | | | | | | | | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 现在怎么办
以下是你可以使用此代码的几种方法
通过询问有关代码的问题例如定位函数或了解代码各部分的工作原理深入了解特定的 GitHub 存储库。构建一个多代理 RAG 系统该系统可提取 GitHub PR 和问题从而实现对问题的自动响应和对 PR 的反馈。将你的日志和指标与 Elasticsearch 中的 GitHub 代码相结合使用 RAG 创建生产就绪审查帮助评估服务的成熟度。
祝你 RAG 愉快 准备好自己尝试一下了吗开始免费试用。
Elasticsearch 集成了 LangChain、Cohere 等工具。加入我们的 Beyond RAG Basics 网络研讨会构建你的下一个 GenAI 应用程序 原文Ask questions about your GitHub repository with Elasticsearch as a vector database - Search Labs