Ранее представлено в столбцеLLMИнструмент примененияRAG,По принципу своей реализации,Мы видим, что у него есть большой недостаток: он вызывает только фрагменты среза во время процесса поиска.,Поэтому я могу ответить только на частичные вопросы по документации.,Невозможно ответить на вопросы глобальной базы знаний.
Некоторое время назад Microsoft открыла исходный код GraphRAG (Graph-based Retrival Augmented Generation), который объединяет графы знаний на основе общего RAG. Основной процесс GraphRAG: использование LLM для извлечения сущностей и связей сущностей из базы знаний; использование LLM для кластеризации связей сущностей и создания сводок сообщества; использование LLM в сочетании с сводками сообщества для ответов на вопросы пользователей. Таким образом, вопросы и ответы на основе GraphRAG предоставляют более подробную информацию о контексте и взаимосвязях, улучшая эффект поиска и генерации при создании RAG. Весь конвейер показан на рисунке ниже.
paper: https://arxiv.org/pdf/2404.16130
code: https://github.com/microsoft/graphrag
Эта статья шаг за шагом научит вас использовать GraphRAG~.
Среда GraphRAG очень проста: просто установите пакет Graphrag.
conda create -n graphrag python=3.10
git clone https://github.com/microsoft/graphrag.git
pip install graphrag
Лучший пример графа знаний, который я сразу могу вспомнить, — это график отношений персонажей из Dream of Red Mansions. Таким образом, вы можете использовать большую модель для создания профиля четырех основных семей и ключевых фигур в Dream of Red Mansions, а затем сохранить его в формате txt. Конечно, вы также можете напрямую использовать демонстрационные данные или бизнес-данные (следует отметить, что весь процесс GraphRAG потребляет много токенов).
Мы создаем каталог htmtest/input и помещаем в него текстовый файл.
mkdir -p ./hlmtest/input
Выполните следующую команду в корневом каталоге GraphRAG.
python -m graphrag.index --init --root ./hlmtest
Вы получите файлы .env и settings.yaml по умолчанию со следующей структурой:
Настройте свой openai_key в файле .env.
GRAPHRAG_API_KEY=your_openai_key
Конфигурация файла settings.yaml в основном настраивает параметры модели, фрагмента, приглашения и других параметров проекта.
Например, модуль LLM, учетная запись Azure, которую я использую, настроен следующим образом:
llm:
api_key: ${GRAPHRAG_API_KEY}
type: azure_openai_chat # or openai_chat
model: model_name
model_supports_json: ture # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
api_base: your_api_base
api_version: your_api_version
deployment_name: your_deployment_name
Например, формат документа и модули правил нарезки, здесь я использую значения по умолчанию. По параметрам видно, что модуль ввода имеет большие возможности для оптимизации с точки зрения поддержки файлов и нарезки.
chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
Глядя на такие модули, как извлечение сущностей, мы видим, что несколько подсказок генерируются автоматически.
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
# Обратите внимание, что вы можете определить свой собственный тип сущности и изменить соответствующий пример приглашения на основе ваших фактических данных.
entity_types: [organization,person,geo,event]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_reports:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
Вы можете сосредоточиться на нескольких подсказках, чтобы примерно понять сущности, связи и структуры вывода отчета, созданные большой моделью.
Давайте посмотрим на приглашение для извлечения объекта:
-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
4. When finished, output {completion_delimiter}
######################
-Examples-
######################
Example 1:
Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.
Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. ��If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.��
The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.
It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
################
Output:
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."{tuple_delimiter}6){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."{tuple_delimiter}5){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}9){completion_delimiter}
#############################
Example 2:
Entity_types: [person, technology, mission, organization, location]
Text:
They were no longer mere operatives; they had become guardians of a threshold, keepers of a message from a realm beyond stars and stripes. This elevation in their mission could not be shackled by regulations and established protocols��it demanded a new perspective, a new resolve.
Tension threaded through the dialogue of beeps and static as communications with Washington buzzed in the background. The team stood, a portentous air enveloping them. It was clear that the decisions they made in the ensuing hours could redefine humanity's place in the cosmos or condemn them to ignorance and potential peril.
Their connection to the stars solidified, the group moved to address the crystallizing warning, shifting from passive recipients to active participants. Mercer's latter instincts gained precedence�� the team's mandate had evolved, no longer solely to observe and report but to interact and prepare. A metamorphosis had begun, and Operation: Dulce hummed with the newfound frequency of their daring, a tone set not by the earthly
#############
Output:
("entity"{tuple_delimiter}"Washington"{tuple_delimiter}"location"{tuple_delimiter}"Washington is a location where communications are being received, indicating its importance in the decision-making process."){record_delimiter}
("entity"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"mission"{tuple_delimiter}"Operation: Dulce is described as a mission that has evolved to interact and prepare, indicating a significant shift in objectives and activities."){record_delimiter}
("entity"{tuple_delimiter}"The team"{tuple_delimiter}"organization"{tuple_delimiter}"The team is portrayed as a group of individuals who have transitioned from passive observers to active participants in a mission, showing a dynamic change in their role."){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Washington"{tuple_delimiter}"The team receives communications from Washington, which influences their decision-making process."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"The team is directly involved in Operation: Dulce, executing its evolved objectives and activities."{tuple_delimiter}9){completion_delimiter}
#############################
Example 3:
Entity_types: [person, role, technology, organization, event, location, concept]
Text:
their voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.
"It's like it's learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers' a whole new meaning."
Alex surveyed his team��each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."
Together, they stood on the edge of the unknown, forging humanity's response to a message from the heavens. The ensuing silence was palpable��a collective introspection about their role in this grand cosmic play, one that could rewrite human history.
The encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation
#############
Output:
("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}
("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}
("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}
("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}
("entity"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity's Response is the collective action taken by Alex's team in response to a message from an unknown intelligence."){record_delimiter}
("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity's Response to the unknown intelligence."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}7){completion_delimiter}
#############################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:
Каждый должен корректировать подсказку исходя из собственных данных, иначе получаемый эффект будет очень плохим. Если набор данных на китайском языке, вы можете попробовать изменить подсказку.
После установки env и настроек мы можем его выполнить~
Еще раз напомню: во всем процессе много звонков LLM, поэтому у каждого свои ожидания по потреблению токенов. Некоторые пользователи сети используют GPT-4o для документа из 185 тысяч символов, который стоит около 7 долларов США.
python -m graphrag.index --root ./hlmtest
Если выполнение прошло успешно, появится следующий журнал, и вы сможете увидеть процесс и прогресс рабочего процесса:
Давайте посмотрим, что выводится после запуска
indexing-engine.log В этом файле записывается подробная информация журнала конвейера.
Взгляните еще раз на созданные файлы create_final_relationships и create_final_community_reports. Здесь я использую gpt3.5, а подсказка изменена на китайскую (иначе эффект очень плохой).
На основе описанного выше конвейера мы создали промежуточные файлы, такие как диаграммы сущностей, диаграммы отношений и сводки сообщества. Ниже приведены вопросы и ответы, основанные на Graphrag.
python -m graphrag.query --root ./hlmtest --method global «Отношения между Цзя Юаньчунем и Цзя Баоюем?»
python -m graphrag.query --root ./hlmtest --method global «Кто персонажи «Двенадцати заколок Цзиньлина в сне о красных особняках»?»
Выше описан весь процесс использования GraphRAG для вопросов и ответов~ Вы можете сначала попробовать его. Когда у вас будет время, мы тщательно проанализируем исходный код, чтобы увидеть логику реализации каждого шага~.