retrieve_utils
UNSTRUCTURED_FORMATS
These formats will be parsed by the 'unstructured' library, if installed.
split_text_to_chunks
def split_text_to_chunks(text: str,
max_tokens: int = 4000,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
overlap: int = 0)
Split a long text into chunks of max_tokens.
extract_text_from_pdf
def extract_text_from_pdf(file: str) -> str
Extract text from PDF files
split_files_to_chunks
def split_files_to_chunks(
files: list,
max_tokens: int = 4000,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
custom_text_split_function: Callable = None
) -> Tuple[List[str], List[dict]]
Split a list of files into chunks of max_tokens.
get_files_from_dir
def get_files_from_dir(dir_path: Union[str, List[str]],
types: list = TEXT_FORMATS,
recursive: bool = True)
Return a list of all the files in a given directory, a url, a file path or a list of them.
parse_html_to_markdown
def parse_html_to_markdown(html: str, url: str = None) -> str
Parse HTML to markdown.
get_file_from_url
def get_file_from_url(url: str, save_path: str = None) -> Tuple[str, str]
Download a file from a URL.
is_url
def is_url(string: str)
Return True if the string is a valid URL.
create_vector_db_from_dir
def create_vector_db_from_dir(dir_path: Union[str, List[str]],
max_tokens: int = 4000,
client: API = None,
db_path: str = "tmp/chromadb.db",
collection_name: str = "all-my-documents",
get_or_create: bool = False,
chunk_mode: str = "multi_lines",
must_break_at_empty_line: bool = True,
embedding_model: str = "all-MiniLM-L6-v2",
embedding_function: Callable = None,
custom_text_split_function: Callable = None,
custom_text_types: List[str] = TEXT_FORMATS,
recursive: bool = True,
extra_docs: bool = False) -> API
Create a vector db from all the files in a given directory, the directory can also be a single file or a url to a single file. We support chromadb compatible APIs to create the vector db, this function is not required if you prepared your own vector db.
Arguments:
dir_pathUnion[str, List[str]] - the path to the directory, file, url or a list of them.max_tokensOptional, int - the maximum number of tokens per chunk. Default is 4000.clientOptional, API - the chromadb client. Default is None.db_pathOptional, str - the path to the chromadb. Default is "tmp/chromadb.db". The default was/tmp/chromadb.dbfor version <=0.2.24.collection_nameOptional, str - the name of the collection. Default is "all-my-documents".get_or_createOptional, bool - Whether to get or create the collection. Default is False. If True, the collection will be returned if it already exists. Will raise ValueError if the collection already exists and get_or_create is False.chunk_modeOptional, str - the chunk mode. Default is "multi_lines".must_break_at_empty_lineOptional, bool - Whether to break at empty line. Default is True.embedding_modelOptional, str - the embedding model to use. Default is "all-MiniLM-L6-v2". Will be ignored if embedding_function is not None.embedding_functionOptional, Callable - the embedding function to use. Default is None, SentenceTransformer with the givenembedding_modelwill be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples inhttps://docs.trychroma.com/embeddings.custom_text_split_functionOptional, Callable - a custom function to split a string into a list of strings. Default is None, will use the default function inautogen.retrieve_utils.split_text_to_chunks.custom_text_typesOptional, List[str] - a list of file types to be processed. Default is TEXT_FORMATS.recursiveOptional, bool - whether to search documents recursively in the dir_path. Default is True.extra_docsOptional, bool - whether to add more documents in the collection. Default is False
Returns:
The chromadb client.
query_vector_db
def query_vector_db(query_texts: List[str],
n_results: int = 10,
client: API = None,
db_path: str = "tmp/chromadb.db",
collection_name: str = "all-my-documents",
search_string: str = "",
embedding_model: str = "all-MiniLM-L6-v2",
embedding_function: Callable = None) -> QueryResult
Query a vector db. We support chromadb compatible APIs, it's not required if you prepared your own vector db and query function.
Arguments:
query_textsList[str] - the list of strings which will be used to query the vector db.n_resultsOptional, int - the number of results to return. Default is 10.clientOptional, API - the chromadb compatible client. Default is None, a chromadb client will be used.db_pathOptional, str - the path to the vector db. Default is "tmp/chromadb.db". The default was/tmp/chromadb.dbfor version <=0.2.24.collection_nameOptional, str - the name of the collection. Default is "all-my-documents".search_stringOptional, str - the search string. Only docs that contain an exact match of this string will be retrieved. Default is "".embedding_modelOptional, str - the embedding model to use. Default is "all-MiniLM-L6-v2". Will be ignored if embedding_function is not None.embedding_functionOptional, Callable - the embedding function to use. Default is None, SentenceTransformer with the givenembedding_modelwill be used. If you want to use OpenAI, Cohere, HuggingFace or other embedding functions, you can pass it here, follow the examples inhttps://docs.trychroma.com/embeddings.
Returns:
The query result. The format is:
class QueryResult(TypedDict):
ids: List[IDs]
embeddings: Optional[List[List[Embedding]]]
documents: Optional[List[List[Document]]]
metadatas: Optional[List[List[Metadata]]]
distances: Optional[List[List[float]]]