各位学友大家好,这几天几个外院的老师问我想构建自己的关系型数据库应该怎么做?目前数据分析已经成为独立的研究方向,构建自己的数据库是很有意义的,这里给大家说一下 首先进入自己的psql,输入以下代码 postgres=# CREATE TABLESPACE sicdb_space LOCATION 'E:/pgsql_data'; CREATE TABLESPACE postgres=# CREATE DATABASE sicdb TABLESPACE sicdb_space; CREATE DATABASE 然后找到自己对应的链接地址用python输入 import pandas as pd from sqlalchemy import create_engine, text from sqlalchemy.exc import SQLAlchemyError import glob import os # PostgreSQL database connection string connection_string = 'postgresql://postgres:3027291@localhost:5432/sicdb' # Create the engine engine = create_engine(connection_string) # Directory where your .csv.gz files are located (update this to your actual directory) directory = 'E:/pgsql_data/salzburg' # Replace with the actual path to your files on C drive # Check if cases.csv.gz exists in the directory if not glob.glob(f'{directory}/cases.csv.gz'): print("File cases.csv.gz not found in the directory.") exit() # Size of each chunk (number of rows) chunk_size = 5000 # Adjust based on your system's capabilities # Loop through each .csv.gz file in the directory for filepath in glob.glob(f'{directory}/*.csv.gz'): # Extract the file name without extension to use as table name table_name = os.path.basename(filepath).replace('.csv.gz', '') # Inform about the process starting for this file print(f"Starting to process {filepath} into {table_name}...") # Initialize a counter for chunks chunk_counter = 0 primary_key = None primary_key_set = False try: # Establish a connection conn = engine.connect() # Read and insert the CSV file in chunks for chunk in pd.read_csv(filepath, compression='gzip', chunksize=chunk_size): if primary_key is None: primary_key = chunk.columns[0] # This basic script does not handle interruptions, so we will delete the table if it exists conn.execute(text(f'DROP TABLE IF EXISTS {table_name};')) conn.commit() trans = conn.begin() chunk.to_sql(name=table_name, con=conn, if_exists='append', index=False) trans.commit() # Pandas does not set a primary key. For query time reasons this cannot be done at the end, so do it now if primary_key_set is False: print(f"Setting primary key...") conn.execute(text(f'ALTER TABLE {table_name} ADD PRIMARY KEY ("{primary_key}");')) conn.commit() # Set the most important indices important_indices = ["CaseID", "PatientID", "DataID", "LaboratoryID", "RefID", "DrugID", "Offset", "FieldID"] for index in important_indices: if index in chunk.columns and index != primary_key: print(f"Setting {index} as index in {table_name}") conn.execute(text(f'CREATE INDEX ON {table_name} ("{index}");')) conn.commit() primary_key_set = True chunk_counter += 1 row_count = chunk_counter * chunk_size print(f"Processed chunk {chunk_counter} ({row_count} rows) for {table_name}") print(f"Finished adding data to {table_name}") except SQLAlchemyError as e: # Roll back the transaction on error #trans.rollback() print(f"An error occurred: {e}") finally: # Ensure the connection is closed conn.close() print(f"Finished processing {filepath}.") 怎么样 过瘾吗?呵呵 查找并删除表空间记录: sql 复制代码 SELECT oid, spcname FROM pg_tablespace WHERE spcname = 'sicdb_space'; DELETE FROM pg_tablespace WHERE spcname = 'sicdb_space'; 在操作系统的命令行中重启 PostgreSQL 服务: sh 复制代码 net stop postgresql-x64-15 net start postgresql-x64-15 重新创建表空间和数据库: sql 复制代码 CREATE TABLESPACE sicdb_space LOCATION 'E:/sicdb_folder'; CREATE DATABASE sicdb TABLESPACE sicdb_space; |
|
来自: 脑系科数据科学 > 《2024python和R》